Re: For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-18 Thread Andrew Ash
I don't know if anyone has done benchmarking of different JVMs for Spark specifically, but the widely-held belief seems to be that the Oracle JDK is slightly more performant. Elasticsearch makes heavy usage of Lucene, which is a particularly intense workout for a JVM, and they recommend using the

Re: For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-18 Thread Gordon Wang
I would like to say that Oracle JDK may be the better choice. A lot of hadoop distribution vendors use Oracle JDK instead of Open JDK for enterprise. On Mon, May 19, 2014 at 2:50 PM, Hao Wang wrote: > Hi, > > Oracle JDK and OpenJDK, which one is better or preferred for Spark? > > > Regards, > W

For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-18 Thread Hao Wang
Hi, Oracle JDK and OpenJDK, which one is better or preferred for Spark? Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Andrew Ash
If the codebase for Spark's broadcast is pretty self-contained, you could consider creating a small bootstrap sent out via the doubling rsync strategy that Mosharaf outlined above (called "Tree D=2" in the paper) that then pulled the larger Mosharaf, do you have a sense of whether the gains from u

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Aaron Davidson
One issue with using Spark itself is that this rsync is required to get Spark to work... Also note that a similar strategy is used for *updating* the spark cluster on ec2, where the "diff" aspect is much more important, as you might only make a small change on the driver node (recompile or reconfi

Re: problem with hdfs access in spark job

2014-05-18 Thread Marcin Cylke
On Thu, 15 May 2014 09:44:35 -0700 Marcelo Vanzin wrote: > These are actually not worrisome; that's just the HDFS client doing > its own thing to support HA. It probably picked the "wrong" NN to try > first, and got the "NN in standby" exception, which it logs. Then it > tries the other NN and th

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Mosharaf Chowdhury
What twitter calls murder, unless it has changed since then, is just a BitTornado wrapper. In 2011, We did some comparison on the performance of murder and the TorrentBroadcast we have right now for Spark's own broadcast (Section 7.1 in http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Andrew Ash
My first thought would be to use libtorrent for this setup, and it turns out that both Twitter and Facebook do code deploys with a bittorrent setup. Twitter even released their code as open source: https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent http://arstechn

Re: Using mongo with PySpark

2014-05-18 Thread Samarth Mailinglist
db = MongoClient()['spark_test_db'] *collec = db['programs']* def mapper(val): asc = val.encode('ascii','ignore') json = convertToJSON(asc, indexMap) collec.insert(json) # *this is not working* def convertToJSON(string, indexMap): values = string.strip().split(",") json = {}

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler
I am not an expert in this space either. I thought the initial rsync during launch is really just a straight copy that did not need the tree diff. So it seemed like having the slaves do the copying among it each other would be better than having the master copy to everyone directly. That made me th

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Aaron Davidson
Out of curiosity, do you have a library in mind that would make it easy to setup a bit torrent network and distribute files in an rsync (i.e., apply a diff to a tree, ideally) fashion? I'm not familiar with this space, but we do want to minimize the complexity of our standard ec2 launch scripts to

sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler
I am launching a rather large cluster on ec2. It seems like the launch is taking forever on Setting up spark RSYNC'ing /root/spark to slaves... ... It seems that bittorrent might be a faster way to replicate the sizeable spark directory to the slaves particularly if there is a lot of not very

unsubscribe

2014-05-18 Thread Venkat Krishnamurthy

Re: unsubscribe

2014-05-18 Thread Madhu
The volume on the list has grown to the point that individual emails can become excessive. That might be the reason for increase in recent unsubscribes. You can subscribe to daily digests only using these addresses: Similar addresses exist for the digest list: The DEV list has an option

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
correct what i said above: ldconfig does work, it automatically makes a link: libopenblas.so.0 -> libopenblas_nehalemp-r0.2.9.rc2.so but what i need is libblas.so.3, so i tried several ways 1. create a file called libblas.so.3, then ldconfig. 2. create a file called libblas.so.3.0 then ldconfig

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
thank you xiangrui, i also think it's maybe the problem of link i tried several ways: 1. export LD_LIBRARY_PATH=mypath 2. create the link file in /usr/lib lrwxrwxrwx 1 root root 34 May 19 00:38 libblas.so.3 -> libopenblas_nehalemp-r0.2.9.rc2.so 3. add mypath to /etc/ld.so.conf, then ldconfig

Re: breeze DGEMM slow in spark

2014-05-18 Thread Xiangrui Meng
The classpath seems to be correct. Where did you link libopenblas*.so to? The safest approach is to rename it to /usr/lib/libblas.so.3 and /usr/lib/liblapack.so.3 . This is the way I made it work. -Xiangrui On Sun, May 18, 2014 at 4:49 PM, wxhsdp wrote: > ok > > Spark Executor Command: "java" "-c

Re: Passing runtime config to workers?

2014-05-18 Thread DB Tsai
When you reference any variable outside the executor's scope, spark will automatically serialize them in the driver, and send them to executors, which implies, those variables have to implement serializable. For the example you mention, the Spark will serialize object F, and if it's not serializab

unsubscribe

2014-05-18 Thread Terje Berg-Hansen
Andre Bois-Crettez skrev: >We never saw your exception when reading bzip2 files with spark. > >But when we wrongly compiled spark against older version of hadoop (was >default in spark), we ended up with sequential reading of bzip2 file, >not taking advantage of block splits to work in paralle

making spark/conf/spark-defaults.conf changes take effect

2014-05-18 Thread Daniel Mahler
I am running in an aws ec2 cluster that i launched using the spark-ec2 script that comes with spark and I use the "-v master" option to run the head version. If I then log into master and make changes to spark/conf/spark-defaults.conf How do I make the changes take effect across the cluster? Is j

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
ok Spark Executor Command: "java" "-cp" ":/root/ephemeral-hdfs/conf:/root/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.10.4.jar:/root/.ivy2/cache/org.scalanlp/breeze_2.10/jars/breeze_2.10-0.7.jar:/root/.ivy2/cache/org.scalanlp/breeze-macros_2.10/jars/breeze-macros_2.10-0.3.jar:/ro

First sample with Spark Streaming and three Time's?

2014-05-18 Thread Jacek Laskowski
Hi, I'm quite new to Spark Streaming and developed the following application to pass 4 strings, process them and shut down: val conf = new SparkConf(false) // skip loading external settings .setMaster("local[1]") // run locally with one thread .setAppName("Spark Streaming with Sca

Spark Shell stuck on standalone mode

2014-05-18 Thread Sidharth Kashyap
Hi, I have configured a cluster with 10 slaves and one master. The master web portal shows all the slaves and looks to be rightly configured. I started the master node with the command MASTER=spark://:38955 $SPARK_HOME/bin/spark-shell This brings in the REPL with the following message though " 14/0

Re: Text file and shuffle

2014-05-18 Thread Han JU
I think the shuffle is unavoidable given that the input partitions (probably hadoop input spits in your case) are not arranged in the way of a cogroup job. But maybe you can try: 1) co-partition you data for cogroup: val par = HashPartitioner(128) val big = sc.textFile(..).map(...).part

Re: unsubscribe

2014-05-18 Thread Andrew Ash
Hi Shangyu (and everyone else looking to unsubscribe!), If you'd like to get off this mailing list, please send an email to user *-unsubscribe*@spark.apache.org, not the regular user@spark.apache.org list. How to use the Apache mailing list infrastructure is documented here: https://www.apache.or

Re: IllegelAccessError when writing to HBase?

2014-05-18 Thread Nan Zhu
I tried hbase-0.96.2/0.98.1/0.98.2 HDFS version is 2.3 -- Nan Zhu On Sunday, May 18, 2014 at 4:18 PM, Nan Zhu wrote: > Hi, all > > I tried to write data to HBase in a Spark-1.0 rc8 application, > > the application is terminated due to the java.lang.IllegalAccessError, Hbase > shell work

IllegelAccessError when writing to HBase?

2014-05-18 Thread Nan Zhu
Hi, all I tried to write data to HBase in a Spark-1.0 rc8 application, the application is terminated due to the java.lang.IllegalAccessError, Hbase shell works fine, and the same application works with a standalone Hbase deployment java.lang.IllegalAccessError: com/google/protobuf/HBaseZero

unsubscribe

2014-05-18 Thread Shangyu Luo
Thanks!

Re: Passing runtime config to workers?

2014-05-18 Thread Robert James
I see - I didn't realize that scope would work like that. Are you saying that any variable that is in scope of the lambda passed to map will be automagically propagated to all workers? What if it's not explicitly referenced in the map, only used by it. E.g.: def main: settings.setSettings rd

Re: File list read into single RDD

2014-05-18 Thread Andrew Ash
Spark's sc.textFile() method delegates to sc.hadoopFile(), which uses Hadoop's FileInputFormat.setInputPaths()

Re: breeze DGEMM slow in spark

2014-05-18 Thread Xiangrui Meng
Can you attach the slave classpath? -Xiangrui On Sun, May 18, 2014 at 2:02 AM, wxhsdp wrote: > Hi, xiangrui > > you said "It doesn't work if you put the netlib-native jar inside an > assembly > jar. Try to mark it "provided" in the dependencies, and use --jars to > include them with spark-s

Re: File list read into single RDD

2014-05-18 Thread Pat Ferrel
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since Spark supports several FS schemes I’m unclear about how much to assume about using the hadoop file systems APIs and conventions. Concretely if I pass a pattern in with a HTTPS file system, will the pattern work? How

Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread Daniel Mahler
Hi Matei, Thanks for the suggestions. Is the number of partitions set by calling 'myrrd.partitionBy(new HashPartitioner(N))'? Is there some heuristic formula for choosing a good number of partitions? thanks Daniel On Sat, May 17, 2014 at 8:33 PM, Matei Zaharia wrote: > Make sure you set up e

Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread lukas nalezenec
Hi Try using *reduceByKeyLocally*. Regards Lukas Nalezenec On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia wrote: > Make sure you set up enough reduce partitions so you don’t overload them. > Another thing that may help is checking whether you’ve run out of local > disk space on the machines, and

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
Hi, xiangrui you said "It doesn't work if you put the netlib-native jar inside an assembly jar. Try to mark it "provided" in the dependencies, and use --jars to include them with spark-submit. -Xiangrui" i'am not use an assembly jar which contains every thing, i also mark breeze depende

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
in case 1, breeze dependency in sbt.build file automatically downloads the jars and add them to classpath. in spark case, i manually download all the jars and add them to spark classpath why case 1 succeeded, and case 2 failed? do i miss something? -- View this message in context: http://apac

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
Hi, xiangrui i check the stderr of worker node, yes it's failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS... what do you mean by "include breeze-natives or netlib:all"? things i've already done: 1. add breeze and breeze native dependency in sbt build fil