Re: Hi, guys, does anyone use Spark in finance market?

2016-09-01 Thread Adam Roberts
Hi, yes, there's definitely a market for Apache Spark and financial 
institutions, I can't provide specific details but to answer your survey: 
"yes" and "more than a few GB!"

Here are a couple of examples showing Spark with financial data, full 
disclosure that I work for IBM, I'm sure there are lots more examples you 
can find too:

https://www.youtube.com/watch?v=VWBNoIwGEjo shows how Spark can be used 
with simple sentiment analysis to figure out correlations between real 
world events and stock market changes. The Spark specific part is from 3 
04 until the 8th minute
https://www.youtube.com/watch?v=sDmWcuO5Rk8 is a similar example where 
Spark is again used with sentiment analysis. One could also analyse 
financial data to identify trends, I think a lot of the machine learning 
APIs will be useful here e.g. logistic regression with many features could 
be used to decide whether or not an investment is a good idea based on 
training data (so we'd look at real outcomes from previous speculations)

In both cases you can see Spark is a very important component for 
performing the calculations with financial data.

I also know that Goldman Sachs mentioned they are interested in Spark, one 
talk is at https://www.youtube.com/watch?v=HWwAoTK2YrQ, so this is more 
evidence of the financial industries paying attention to big data and 
Spark.

Regarding your app: I expected it to be similar to the first example where 
the signals you mention are real world events (e.g. the fed lowers 
interest rates or companies are rumoured to either be about to float or be 
acquired). 

At the 4 30 part I think you actually identify previous index values and 
extrapolate what they are likely to become using, so in theory your system 
would become more accurate over time although would going off indexes 
alone be sufficient (if indeed this is what you're doing). 

I think you'd want to combine this with real world speculation/news to 
figure out *why* the price is likely to change, how much by and in which 
direction.

I agree that Apache Spark can be just the right tool for doing the heavy 
lifting required for analysis, computation and modelling of big data so 
looking forward to future Spark work in this area, and I wonder how we as 
Spark developers can make it easier/more powerful for Spark users to do so




From:   "Taotao.Li" 
To: user 
Date:   30/08/2016 14:14
Subject:Hi, guys, does anyone use Spark in finance market?




Hi, guys,

 I'm a quant engineer in China, and I believe it's very promising when 
using Spark in the financial market. But I didn't find cases which combine 
spark and finance.

So here I wanna do a small survey: 

do you guys use Spark in financial market related project?
if yes, how large data was fed in your spark application?

 thanks a lot.

___​
​A little ad, I attended IBM Spark Hackathon, which is here : 
http://apachespark.devpost.com/ , and I submitted a small application, 
which will be used in my strategies, hope you guys and give me a vote and 
some suggestions on how to use spark in financial market, to discover some 
trade opportunity.

here is my small app: 
http://devpost.com/software/spark-in-finance-quantitative-investing

thanks a lot.​


-- 
___
Quant | Engineer | Boy
___
blog:http://litaotao.github.io
github: www.github.com/litaotao

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU



Re: Spark 2.0.0 - Java vs Scala performance difference

2016-09-01 Thread Adam Roberts
On Java vs Scala: Sean's right that behind the scenes you'll be calling 
JVM based APIs anyway (e.g. sun.misc.unsafe for Tungsten) and that the 
vast majority of Apache Spark's important logic is written in Scala.

Would be an interesting experiment to write the same functioning program 
using the Java APIs vs Scala APIs just to see if there is a noticeable 
difference: I'm thinking in terms of how the Scala implementation 
libraries perform at runtime, with profiling (we use Healthcenter, tprof, 
or just microbenchmarking with prints and timers), we've seen lots of code 
in Scala itself to do with (un)boxing and instanceOf checks that could do 
with some TLC for performance.

Now quite outdated but still shows that writing what's concise (Scala) 
isn't always best for performance: 
https://jazzy.id.au/2012/10/16/benchmarking_scala_against_java.html

So if we just to stick to Java we may not hit those overheads as often 
(there's a talk by my colleague on boosting performance from a Java 
implementer's perspective at https://www.youtube.com/watch?v=rcVTM-71bZk), 
but I don't expect the differences to be enormous. Full disclosure that I 
work for IBM and one of our goals is to improve Apache Spark and our Java 
implementation to perform fast together.

There's also the obvious trade-off of developer productivity and code 
maintainability (more Java devs than Scala devs), so my suggestion is that 
if you're much better at writing Java or Scala code, use that for what is 
considered the real important performance critical logic - be aware that 
you're going be hitting the Apache Spark codebase written in Scala anyway, 
so there's only so much to be gained here.

I also think that Just in Time Compiler implementations are generally 
better at optimising what's written as Java code instead of Scala code as 
knowing the types way ahead of time and where we can make codepath 
shortcuts in the bytecode execution should deliver a slight performance 
improvements. I am keen to come up with some solid recommendations based 
on evidence for us all to benefit from.




From:   Aseem Bansal 
To: ayan guha 
Cc: Sean Owen , user 
Date:   01/09/2016 13:11
Subject:Re: Spark 2.0.0 - Java vs Scala performance difference



there is already a mail thread for scala vs python. check the archives

On Thu, Sep 1, 2016 at 5:18 PM, ayan guha  wrote:
How about Scala vs Python?

On Thu, Sep 1, 2016 at 7:27 PM, Sean Owen  wrote:
I can't think of a situation where it would be materially different.
Both are using the JVM-based APIs directly. Here and there there's a
tiny bit of overhead in using the Java APIs because something is
translated from a Java-style object to a Scala-style object, but this
is generally trivial.

On Thu, Sep 1, 2016 at 10:06 AM, Aseem Bansal  
wrote:
> Hi
>
> Would there be any significant performance difference when using Java 
vs.
> Scala API?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-- 
Best Regards,
Ayan Guha


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: Spark build 1.6.2 error

2016-08-31 Thread Adam Roberts
Looks familiar, got the zinc server running and using a shared dev box?

ps -ef | grep "com.typesafe zinc.Nailgun", look for the zinc server 
process, kill it and try again, Spark branch-1.6 builds great here from 
scratch, had plenty of problems thanks to running the zinc server here 
(started with build/mvn)




From:   Nachiketa 
To: Diwakar Dhanuskodi 
Cc: user 
Date:   31/08/2016 12:17
Subject:Re: Spark build 1.6.2 error



Hi Diwakar,

Could you please share the entire maven command that you are using to 
build ? And also the JDK version you are using ?

Also could you please confirm that you did execute the script for change 
scala version to 2.11 before starting the build ? Thanks.

Regards,
Nachiketa

On Wed, Aug 31, 2016 at 2:00 AM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:
Hi, 

While building Spark 1.6.2 , getting below error in spark-sql. Much 
appreciate for any help.

ERROR] missing or invalid dependency detected while loading class file 
'WebUI.class'.
Could not access term eclipse in package org,
because it (or its dependencies) are missing. Check your build definition 
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
the problematic classpath.)
A full rebuild may help if 'WebUI.class' was compiled against an 
incompatible version of org.
[ERROR] missing or invalid dependency detected while loading class file 
'WebUI.class'.
Could not access term jetty in value org.eclipse,
because it (or its dependencies) are missing. Check your build definition 
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
the problematic classpath.)
A full rebuild may help if 'WebUI.class' was compiled against an 
incompatible version of org.eclipse.
[WARNING] 17 warnings found
[ERROR] two errors found
[INFO] 

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS 
[4.399s]
[INFO] Spark Project Test Tags ... SUCCESS 
[3.443s]
[INFO] Spark Project Launcher  SUCCESS 
[10.131s]
[INFO] Spark Project Networking .. SUCCESS 
[11.849s]
[INFO] Spark Project Shuffle Streaming Service ... SUCCESS 
[6.641s]
[INFO] Spark Project Unsafe .. SUCCESS 
[19.765s]
[INFO] Spark Project Core  SUCCESS 
[4:16.511s]
[INFO] Spark Project Bagel ... SUCCESS 
[13.401s]
[INFO] Spark Project GraphX .. SUCCESS 
[1:08.824s]
[INFO] Spark Project Streaming ... SUCCESS 
[2:18.844s]
[INFO] Spark Project Catalyst  SUCCESS 
[2:43.695s]
[INFO] Spark Project SQL . FAILURE 
[1:01.762s]
[INFO] Spark Project ML Library .. SKIPPED
[INFO] Spark Project Tools ... SKIPPED
[INFO] Spark Project Hive  SKIPPED
[INFO] Spark Project Docker Integration Tests  SKIPPED
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project YARN Shuffle Service  SKIPPED
[INFO] Spark Project YARN  SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Flume Sink . SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External Flume Assembly . SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project External MQTT Assembly .. SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project Examples  SKIPPED
[INFO] Spark Project External Kafka Assembly . SKIPPED
[INFO] 

[INFO] BUILD FAILURE
[INFO] 

[INFO] Total time: 12:40.525s
[INFO] Finished at: Wed Aug 31 01:56:50 IST 2016
[INFO] Final Memory: 71M/830M
[INFO] 

[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) 
on project spark-sql_2.11: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed 
-> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the 
-e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, 
please read the following articles:
[ERROR]

Re: Memory allocation error with Spark 1.5, HashJoinCompatibilitySuite

2015-08-24 Thread Adam Roberts
Hi, I'm regularly hitting "Unable to acquire memory" problems only when
trying to use overflow pages when running the full set of Spark tests across
different platforms. The machines I'm using all have well over 10 GB of RAM
and I'm running without any changes to the pom.xml file. Standard 3 GB Java
heap specified.

I'm working off this revision:

commit 43e0135421b2262cbb0e06aae53523f663b4f959
Author: Yin Huai 
Date:   Thu Aug 20 15:30:31 2015 +0800

[SPARK-10092] [SQL] Multi-DB support follow up.

https://issues.apache.org/jira/browse/SPARK-10092

This pr is a follow-up one for Multi-DB support. It has the following
changes:

* `HiveContext.refreshTable` now accepts `dbName.tableName`.

I've added prints in a variety of places, when we run just the one suite we
don't hit the problem - but with the whole batch of tests, we do.

Example below, note that it's always in the join31 test.

cat CheckHashJoinFullBatch.txt | grep -C 10 "join31"
- auto_join30
Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
- auto_join31
- auto_join32
- auto_join4
- auto_join5
- auto_join6
- auto_join7
- auto_join8
- auto_join9
04:53:44.685 WARN
org.apache.spark.sql.hive.execution.HashJoinCompatibilitySuite:
Simplifications made on unsupported operations for test auto_join_filters
- auto_join_filters
- auto_join_nulls
--
05:08:18.329 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 43.0
in stage 2993.0 (TID 130982, localhost): TaskKilled (killed intentionally)
05:08:18.330 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 40.0
in stage 2993.0 (TID 130979, localhost): TaskKilled (killed intentionally)
05:08:18.340 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 15.0
in stage 2993.0 (TID 130954, localhost): TaskKilled (killed intentionally)
05:08:18.341 ERROR org.apache.spark.executor.Executor: Managed memory leak
detected; size = 12582912 bytes, TID = 130985
05:08:18.341 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 46.0
in stage 2993.0 (TID 130985, localhost): TaskKilled (killed intentionally)
05:08:18.343 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 41.0
in stage 2993.0 (TID 130980, localhost): TaskKilled (killed intentionally)
05:08:18.343 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 26.0
in stage 2993.0 (TID 130965, localhost): TaskKilled (killed intentionally)
05:08:18.345 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 4.0
in stage 2993.0 (TID 130943, localhost): TaskKilled (killed intentionally)
05:08:18.345 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 11.0
in stage 2993.0 (TID 130950, localhost): TaskKilled (killed intentionally)
05:08:18.349 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 28.0
in stage 2993.0 (TID 130967, localhost): TaskKilled (killed intentionally)
- join31 *** FAILED ***
  Failed to execute query using catalyst:
  Error: Job aborted due to stage failure: Task 42 in stage 2993.0 failed 1
times, most recent failure: Lost task 42.0 in stage 2993.0 (TID 130981,
localhost): java.io.IOException: Unable to acquire 4194304 bytes of memory
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:371)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:350)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:489)
at
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:138)
at
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:477)
at
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368)
at
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:610)
at
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)

I run the test on its own with:
mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver
-DwildcardSuites=org.apache.spark.sql.hive.execution.HashJoinCompatibilitySuite
-fn test > CheckHashJoin.txt 2>&1

I run the whole batch with
mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver -fn test >
CheckHashJoinFullBatch.txt 2>&1

java version "1.7.0_65"
OpenJDK Runtime Environment (rhel-2.5.1.2.el7_0-x86_64 u65-b17)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed