Rank for SQL and ORDER BY?

2015-01-09 Thread Kevin Burton
I’m trying to do simple graph sort in Spark which I mostly have working. The one problem I have now is that I need to order them and then assign a rank position. So the top item should have rank 0, the next one should have rank 1, etc. Hive and Pig support this with the RANK operator. I

Re: Discrepancy in PCA values

2015-01-09 Thread Upul Bandara
Hi Xiangrui, Thanks for the reply. Julia code is also using the covariance matrix: (1/n)*X'*X ; Thanks, Upul On Fri, Jan 9, 2015 at 2:11 AM, Xiangrui Meng men...@gmail.com wrote: The Julia code is computing the SVD of the Gram matrix. PCA should be applied to the covariance matrix.

Re: Web Service + Spark

2015-01-09 Thread Corey Nolet
Cui Lin, The solution largely depends on how you want your services deployed (Java web container, Spray framework, etc...) and if you are using a cluster manager like Yarn or Mesos vs. just firing up your own executors and master. I recently worked on an example for deploying Spark services

RE: Implement customized Join for SparkSQL

2015-01-09 Thread Dai, Kevin
Hi, Rishi You are right. But the ids may be tens of thousands and B is a database with index for id, which means querying by id is very fast. In fact we load A and B as separate schemaRDDs as you suggested. But we hope we can extend the join implementation to achieve it in the parsing stage.

/tmp directory fills up

2015-01-09 Thread Alessandro Baretta
Gents, I'm building spark using the current master branch and deploying in to Google Compute Engine on top of Hadoop 2.4/YARN via bdutil, Google's Hadoop cluster provisioning tool. bdutils configures Spark with spark.local.dir=/hadoop/spark/tmp, but this option is ignored in combination with

Re: DeepLearning and Spark ?

2015-01-09 Thread Peng Cheng
Not if broadcast can only be used between stages. To enable this you have to at least make broadcast asynchronous non-blocking. On 9 January 2015 at 18:02, Krishna Sankar ksanka...@gmail.com wrote: I am also looking at this domain. We could potentially use the broadcast capability in Spark to

OOM exception during row deserialization

2015-01-09 Thread Pala M Muthaia
Hi, I am using Spark 1.0.1. I am trying to debug a OOM exception i saw during a join step. Basically, i have a RDD of rows, that i am joining with another RDD of tuples. Some of the tasks succeed but a fair number failed with OOM exception with stack below. The stack belongs to the 'reducer'

Web Service + Spark

2015-01-09 Thread Cui Lin
Hello, All, What’s the best practice on deploying/publishing spark-based scientific applications into a web service? Similar to Shiny on R. Thanks! Best regards, Cui Lin

Re: OptionalDataException during Naive Bayes Training

2015-01-09 Thread Xiangrui Meng
How big is your data? Did you see other error messages from executors? It seems to me like a shuffle communication error. This thread may be relevant: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3ccalrnvjuvtgae_ag1rqey_cod1nmrlfpesxgsb7g8r21h0bm...@mail.gmail.com%3E

How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-09 Thread nishanthps
Hi, The userId's and productId's in my data are bigInts, what is the best way to run collaborative filtering on this data. Should I modify MLlib's implementation to support more types? or is there an easy way. Thanks!, Nishanth -- View this message in context:

Issue writing to Cassandra from Spark

2015-01-09 Thread Ankur Srivastava
Hi, We are currently using spark to join data in Cassandra and then write the results back into Cassandra. While reads happen with out any error during the writes we see many exceptions like below. Our environment details are: - Spark v 1.1.0 - spark-cassandra-connector-java_2.10 v 1.1.0 We are

Re: How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-09 Thread Xiangrui Meng
Do you have more than 2 billion users/products? If not, you can pair each user/product id with an integer (check RDD.zipWithUniqueId), use them in ALS, and then join the original bigInt IDs back after training. -Xiangrui On Fri, Jan 9, 2015 at 5:12 PM, nishanthps nishant...@gmail.com wrote: Hi,

Re: Web Service + Spark

2015-01-09 Thread gtinside
You can also look at Spark Job Server https://github.com/spark-jobserver/spark-jobserver - Gaurav On Jan 9, 2015, at 10:25 PM, Corey Nolet cjno...@gmail.com wrote: Cui Lin, The solution largely depends on how you want your services deployed (Java web container, Spray framework, etc...)

Re: Discrepancy in PCA values

2015-01-09 Thread Xiangrui Meng
You need to subtract mean values to obtain the covariance matrix (http://en.wikipedia.org/wiki/Covariance_matrix). On Fri, Jan 9, 2015 at 6:41 PM, Upul Bandara upulband...@gmail.com wrote: Hi Xiangrui, Thanks for the reply. Julia code is also using the covariance matrix: (1/n)*X'*X ;

Re: Accidental kill in UI

2015-01-09 Thread Nicholas Chammas
As Sean said, this definitely sounds like something worth a JIRA issue (and PR). On Fri Jan 09 2015 at 8:17:34 AM Sean Owen so...@cloudera.com wrote: (FWIW yes I think this should certainly be a POST. The link can become a miniature form to achieve this and then the endpoint just needs to

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Xiangrui Meng
colStats() computes the mean values along with several other summary statistics, which makes it slower. How is the performance if you don't use kryo? -Xiangrui On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar rokros...@gmail.com wrote: thanks for the suggestion -- however, looks like this is even

Submitting SparkContext and seeing driverPropsFetcher exception

2015-01-09 Thread Corey Nolet
I'm seeing this exception when creating a new SparkContext in YARN: [ERROR] AssociationError [akka.tcp://sparkdri...@coreys-mbp.home:58243] - [akka.tcp://driverpropsfetc...@coreys-mbp.home:58453]: Error [Shut down address: akka.tcp://driverpropsfetc...@coreys-mbp.home:58453] [

Re: Zipping RDDs of equal size not possible

2015-01-09 Thread Xiangrui Meng
sample 2 * n tuples, split them into two parts, balance the sizes of these parts by filtering some tuples out How do you guarantee that the two RDDs have the same size? -Xiangrui On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke 1wil...@informatik.uni-hamburg.de wrote: Hi Spark community, I have

Re:Re: EventBatch and SparkFlumeProtocol not found in spark codebase?

2015-01-09 Thread Todd
Thanks Sean. I follow the guide, import the codebase into IntellijIdea as Maven project, with the profiles:hadoop2.4 and yarn. In the maven project view, I run Maven Install against the module: Spark Project Parent POM(root).After a pretty long time, all the modules are built successfully.

Re: Questions about Spark and HDFS co-location

2015-01-09 Thread Sean Owen
Spark uses MapReduce InputFormat implementations to read data from disk, so in that sense it has access to, and uses, the same locality info that things like MR do. Yes, tasks go to the data, and you want to run Spark on top of the HDFS DataNodes. (Locality isn't always the only priority that

Re: Questions about Spark and HDFS co-location

2015-01-09 Thread Ted Yu
I was looking for related information and found: http://spark-summit.org/wp-content/uploads/2013/10/Spark-Ops-Final.pptx See also http://hbase.apache.org/book.html#perf.hdfs.configs.localread for how short circuit read is enabled. Cheers On Fri, Jan 9, 2015 at 3:50 PM, Sean Owen

Re: Questions about Spark and HDFS co-location

2015-01-09 Thread Andrew Ash
Note also for short circuit reads that early versions are actually net-negative in performance. Only after a second hadoop release of the feature did it turn towards being a positive change. See earlier threads on this mailing list where short circuit reads are discussed. On Fri, Jan 9, 2015 at

DeepLearning and Spark ?

2015-01-09 Thread Jaonary Rabarisoa
Hi all, DeepLearning algorithms are popular and achieve many state of the art performance in several real world machine learning problems. Currently there are no DL implementation in spark and I wonder if there is an ongoing work on this topics. We can do DL in spark Sparkling water and H2O but

Re: DeepLearning and Spark ?

2015-01-09 Thread Marco Shaw
Pretty vague on details: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A227199 On Jan 9, 2015, at 11:39 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, DeepLearning algorithms are popular and achieve many state of the art performance in several real world

How to access OpenHashSet in my standalone program?

2015-01-09 Thread Tae-Hyuk Ahn
Hi, I would like to use OpenHashSet (org.apache.spark.util.collection.OpenHashSet) in my standalone program. I can import it without error as: import org.apache.spark.util.collection.OpenHashSet However, when I try to access it, I am getting an error as: object OpenHashSet in package

RE: Cleaning up spark.local.dir automatically

2015-01-09 Thread michael.england
Thanks, I imagine this will kill any cached RDDs if their files are beyond the ttl? Thanks From: Raghavendra Pandey [mailto:raghavendra.pan...@gmail.com] Sent: 09 January 2015 15:29 To: England, Michael (IT/UK); user@spark.apache.org Subject: Re: Cleaning up spark.local.dir automatically You

Re: Cleaning up spark.local.dir automatically

2015-01-09 Thread Raghavendra Pandey
You may like to look at spark.cleaner.ttl configuration which is infinite by default. Spark has that configuration to delete temp files time to time. On Fri Jan 09 2015 at 8:34:10 PM michael.engl...@nomura.com wrote: Hi, Is there a way of automatically cleaning up the spark.local.dir after

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-09 Thread Sasi
Boris, Yes, as you mentioned, we are creating a new SparkContext for our Job. The reason being, to define Apache Cassandra connection using SparkConf. We hope, this also should work. For uploading JAR, we followed (1) Package JAR using *sbt package* command (2) Use *curl --data-binary

RE: Spark History Server can't read event logs

2015-01-09 Thread michael.england
Hi Marcelo, On MapR, the mapr user can read the files using the NFS mount, however using the normal hadoop fs -cat /... command, I get permission denied. As the history server is pointing to a location on mapfs, not the NFS mount, I'd imagine the Spark history server is trying to read the

Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread Xuelin Cao
Thanks, but, how to increase the tasks per core? For example, if the application claims 10 cores, is it possible to launch 100 tasks concurrently? On Fri, Jan 9, 2015 at 2:57 PM, Jörn Franke jornfra...@gmail.com wrote: Hallo, Based on experiences with other software in virtualized

Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-09 Thread Mukesh Jha
I am using pre built *spark-1.2.0-bin-hadoop2.4* from *[1] *to submit spark applications to yarn, I cannot find the pre built spark for *CDH-5.x* versions. So, In my case the org.apache.hadoop.yarn.util.ConverterUtils class is coming from the spark-assembly-1.1.0-hadoop2.4.0.jar which is part of

EventBatch and SparkFlumeProtocol not found in spark codebase?

2015-01-09 Thread bit1...@163.com
Hi, When I fetch the Spark code base and import into Intellj Idea as SBT project, then I build it with SBT, but there is compiling errors in the examples module,complaining that the EventBatch and SparkFlumeProtocol,looks they should be in org.apache.spark.streaming.flume.sink package. Not

Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Jun Yang
Guys, I have a question regarding to Spark 1.1 broadcast implementation. In our pipeline, we have a large multi-class LR model, which is about 1GiB size. To employ the benefit of Spark parallelism, a natural thinking is to broadcast this model file to the worker node. However, it looks that

Re: skipping header from each file

2015-01-09 Thread Sean Owen
I think this was already answered on stackoverflow: http://stackoverflow.com/questions/27854919/skipping-header-file-from-each-csv-file-in-spark where the one additional idea would be: If there were just one header line, in the first record, then the most efficient way to filter it out is:

RE: skipping header from each file

2015-01-09 Thread Somnath Pandeya
May be you can use wholeTextFiles method, which returns filename and content of the file as PariRDD and ,then you can remove the first line from files. -Original Message- From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com] Sent: Friday, January 09, 2015 11:48 AM To:

Re: java.io.IOException: Mkdirs failed to create file:/some/path/myapp.csv while using rdd.saveAsTextFile(fileAddress) Spark

2015-01-09 Thread firemonk9
I am facing same exception in saveAsObjectFile. Have you found any solution ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-Mkdirs-failed-to-create-file-some-path-myapp-csv-while-using-rdd-saveAsTextFile-k-tp20994p21066.html Sent

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-09 Thread Xiangrui Meng
This is worker log, not executor log. The executor log can be found in folders like /newdisk2/rta/rtauser/workerdir/app-20150109182514-0001/0/ . -Xiangrui On Fri, Jan 9, 2015 at 5:03 AM, Priya Ch learnings.chitt...@gmail.com wrote: Please find the attached worker log. I could see stream closed

Re: Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Davies Liu
In the current implementation of TorrentBroadcast, the blocks are fetched one-by-one in single thread, so it can not fully utilize the network bandwidth. Davies On Fri, Jan 9, 2015 at 2:11 AM, Jun Yang yangjun...@gmail.com wrote: Guys, I have a question regarding to Spark 1.1 broadcast

Re: DeepLearning and Spark ?

2015-01-09 Thread Peng Cheng
You are not the first :) probably not the fifth to have the question. parameter server is not included in spark framework and I've seen all kinds of hacking to improvise it: REST api, HDFS, tachyon, etc. Not sure if an 'official' benchmark implementation will be released soon On 9 January 2015

RE: RowMatrix.multiply() ?

2015-01-09 Thread Adrian Mocanu
I’m resurrecting this thread because I’m interested in doing transpose on a RowMatrix. There is this other thread too: http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-td12562.html Which presents https://issues.apache.org/jira/browse/SPARK-3434 which is still

Re: Parquet compression codecs not applied

2015-01-09 Thread Michael Armbrust
This is a little confusing, but that code path is actually going through hive. So the spark sql configuration does not help. Perhaps, try: set parquet.compression=GZIP; On Fri, Jan 9, 2015 at 2:41 AM, Ayoub benali.ayoub.i...@gmail.com wrote: Hello, I tried to save a table created via the

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-09 Thread Jerry Lam
Hi Raghavendra, This makes a lot of sense. Thank you. The problem is that I'm using Spark SQL right now to generate the parquet file. What I think I need to do is to use Spark directly and transform all rows from SchemaRDD to avro objects and supply it to use saveAsNewAPIHadoopFile (from the

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-09 Thread Michael Armbrust
The other thing to note here is that Spark SQL defensively copies rows when we switch into user code. This probably explains the difference between 1 2. The difference between 1 3 is likely the cost of decompressing the column buffers vs. accessing a bunch of uncompressed primitive objects.

Re: Failed to save RDD as text file to local file system

2015-01-09 Thread firemonk9
Have you found any resolution for this issue ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050p21067.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Failed to save RDD as text file to local file system

2015-01-09 Thread NingjunWang
No, do you have any idea? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: firemonk9 [via Apache Spark User List] [mailto:ml-node+s1001560n21067...@n3.nabble.com] Sent: Friday, January 09, 2015 2:56 PM To: Wang, Ningjun

Parquet predicate pushdown troubles

2015-01-09 Thread Yana Kadiyska
I am running the following (connecting to an external Hive Metastore) /a/shark/spark/bin/spark-shell --master spark://ip:7077 --conf *spark.sql.parquet.filterPushdown=true* val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) and then ran two queries: sqlContext.sql(select count(*)

updateStateByKey: cleaning up state for keys not in current window

2015-01-09 Thread Simone Franzini
I know that in order to clean up the state for a key I have to return None when I call updateStateByKey. However, as far as I understand, updateStateByKey only gets called for new keys (i.e. keys in current batch), not for all keys in the DStream. So, how can I clear the state for those keys in

Re: RowMatrix.multiply() ?

2015-01-09 Thread Reza Zadeh
Yes that is the correct JIRA. It should make it to 1.3. Best, Reza On Fri, Jan 9, 2015 at 11:13 AM, Adrian Mocanu amoc...@verticalscope.com wrote: I’m resurrecting this thread because I’m interested in doing transpose on a RowMatrix. There is this other thread too:

Questions about Spark and HDFS co-location

2015-01-09 Thread zfry
I am running Spark 1.1.1 built against CDH4 and have a few questions regarding Spark performance related to co-location with HDFS nodes. I want to know whether (and how efficiently) Spark takes advantage of being co-located with a HDFS node? What I mean by this is: if a file is being read by

Re: Reading HBase data - Exception

2015-01-09 Thread lmsiva
This problem got resolved. In spark-submit I was using only --driver-class-path option, once i added --jars option so that the workers are aware of the dependant jar files, the problem went away. Need to check if there are any worker logs that gives better information than the exception I was

Re: Cleaning up spark.local.dir automatically

2015-01-09 Thread Andrew Ash
That's a worker setting which cleans up the files left behind by executors, so spark.cleaner.ttl isn't at the RDD level. After https://issues.apache.org/jira/browse/SPARK-1860 the cleaner won't clean up directories left by running executors. On Fri, Jan 9, 2015 at 7:38 AM,

Re: DeepLearning and Spark ?

2015-01-09 Thread Andrei
Does it makes sense to use Spark's actor system (e.g. via SparkContext.env.actorSystem) to create parameter server? On Fri, Jan 9, 2015 at 10:09 PM, Peng Cheng rhw...@gmail.com wrote: You are not the first :) probably not the fifth to have the question. parameter server is not included in

Re: DeepLearning and Spark ?

2015-01-09 Thread Krishna Sankar
I am also looking at this domain. We could potentially use the broadcast capability in Spark to distribute the parameters. Haven't thought thru yet. Cheers k/ On Fri, Jan 9, 2015 at 2:56 PM, Andrei faithlessfri...@gmail.com wrote: Does it makes sense to use Spark's actor system (e.g. via

Re: RDD Moving Average

2015-01-09 Thread Mohit Jaggi
Read this: http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E

Streaming Checkpointing

2015-01-09 Thread Asim Jalis
In Spark Streaming apps if I enable ssc.checkpoint(dir) does this checkpoint all RDDs? Or is it just checkpointing windowing and state RDDs? For example, if in a DStream I am using an iterative algorithm on a non-state non-window RDD, do I have to checkpoint it explicitly myself, or can I assume

Re: EventBatch and SparkFlumeProtocol not found in spark codebase?

2015-01-09 Thread Sean Owen
What's up with the IJ questions all of the sudden? This PR from yesterday contains a summary of the answer to your question: https://github.com/apache/spark/pull/3952 : Rebuild Project can fail the first time the project is compiled, because generate source files are not automatically generated.

Re: Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Akhil Das
​You can try the following: - Increase ​spark.akka.frameSize (default is 10MB) - Try using torrentBroadcast Thanks Best Regards On Fri, Jan 9, 2015 at 3:41 PM, Jun Yang yangjun...@gmail.com wrote: Guys, I have a question regarding to Spark 1.1 broadcast implementation. In our pipeline, we

KryoDeserialization getting java.io.EOFException

2015-01-09 Thread yuemeng1
hi,when i run a query in spark sql ,there give me follow error,what's processible reason can casuse this problem ava.io.EOFException at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:148) at

OptionalDataException during Naive Bayes Training

2015-01-09 Thread jatinpreet
Hi, I am using Spark Version 1.1 in standalone mode in the cluster. Sometimes, during Naive Baye's training, I get OptionalDataException at line, map at NaiveBayes.scala:109 I am getting following exception on the console, java.io.OptionalDataException:

Re: Queue independent jobs

2015-01-09 Thread Sean Owen
You can parallelize on the driver side. The way to do it is almost exactly what you have here, where you're iterating over a local Scala collection of dates and invoking a Spark operation for each. Simply write dateList.par.map(...) to make the local map proceed in parallel. It should invoke the

Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-09 Thread Sean Owen
Again this is probably not the place for CDH-specific questions, and this one is already answered at http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/CDH-5-3-0-container-cannot-be-fetched-because-of/m-p/23497#M478 On Fri, Jan 9, 2015 at 9:23 AM, Mukesh Jha me.mukesh@gmail.com

Re: Parallel execution on one node

2015-01-09 Thread Sean Owen
The repartitioning has some overhead. Do your results show that perhaps this is what is taking the extra time? You can try reading the source file with more partitions instead of partitioning it after the fact. See the minPartitions argument to loadLibSVMFile. But before that, I'd assess whether

Re: PermGen issues on AWS

2015-01-09 Thread Joe Wass
Thanks, I noticed this after posting. I'll try that. I also think that perhaps Clojure might be creating more classes than the equivalent Java would, so I'll nudge it a bit higher. On 9 January 2015 at 11:45, Sean Owen so...@cloudera.com wrote: It's normal for PermGen to be a bit more of an

Re: Mapping directory structure to columns in SparkSQL

2015-01-09 Thread Michael Davies
Hi Michael, I have got the directory based column support working at least in a trial. I have put the trial code here - DirIndexParquet.scala https://github.com/MickDavies/spark-parquet-dirindex/blob/master/src/main/scala/org/apache/spark/sql/parquet/DirIndexParquet.scala it has involved me

Play Scala Spark Exmaple

2015-01-09 Thread Eduardo Cusa
Hi guys, I running the following example : https://github.com/knoldus/Play-Spark-Scala in the same machine as the spark master, and the spark cluster was lauched with ec2 script. I'm stuck with this errors, any idea how to fix it? Regards Eduardo call the play app prints the following

Newbie Question on How Tasks are Executed

2015-01-09 Thread mixtou
I am writing my first project in spark. It is the implementation of the Space Saving counting algorithm. I am trying to understand how tasks are executed in partitions. As you can see from my code the algorithms keeps in memory only a small amount of words for example 100. The top-k ones not all

Accidental kill in UI

2015-01-09 Thread Joe Wass
So I had a Spark job with various failures, and I decided to kill it and start again. I clicked the 'kill' link in the web console, restarted the job on the command line and headed back to the web console and refreshed to see how my job was doing... the URL at the time was:

Re: Accidental kill in UI

2015-01-09 Thread Sean Owen
(FWIW yes I think this should certainly be a POST. The link can become a miniature form to achieve this and then the endpoint just needs to accept POST only. You should propose a pull request.) On Fri, Jan 9, 2015 at 12:51 PM, Joe Wass jw...@crossref.org wrote: So I had a Spark job with various

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-09 Thread Priya Ch
Please find the attached worker log. I could see stream closed exception On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng men...@gmail.com wrote: Could you attach the executor log? That may help identify the root cause. -Xiangrui On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch

Parquet compression codecs not applied

2015-01-09 Thread Ayoub
Hello, I tried to save a table created via the hive context as a parquet file but whatever compression codec (uncompressed, snappy, gzip or lzo) I set via setConf like: setConf(spark.sql.parquet.compression.codec, gzip) the size of the generated files is the always the same, so it seems like

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-09 Thread Sasi
We are able to resolve *SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up* as well. Spark-jobserver working fine now and need to experiment more. Thank you guys. -- View this message in context:

Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread gen tang
Hi, As you said, the --executor-cores will define the max number of tasks that an executor can take simultaneously. So, if you claim 10 cores, it is not possible to launch more than 10 tasks in an executor at the same time. According to my experience, set cores more than physical CPU core will

Zipping RDDs of equal size not possible

2015-01-09 Thread Niklas Wilcke
Hi Spark community, I have a problem with zipping two RDDs of the same size and same number of partitions. The error message says that zipping is only allowed on RDDs which are partitioned into chunks of exactly the same sizes. How can I assure this? My workaround at the moment is to repartition

Re: Queue independent jobs

2015-01-09 Thread Anders Arpteg
Awesome, it actually seems to work. Amazing how simple it can be sometimes... Thanks Sean! On Fri, Jan 9, 2015 at 12:42 PM, Sean Owen so...@cloudera.com wrote: You can parallelize on the driver side. The way to do it is almost exactly what you have here, where you're iterating over a local

Re: PermGen issues on AWS

2015-01-09 Thread Sean Owen
It's normal for PermGen to be a bit more of an issue with Spark than for other JVM-based applications. You should simply increase the PermGen size, which I don't see in your command. -XX:MaxPermSize=256m allows it to grow to 256m for example. The right size depends on your total heap size and app.

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Rok Roskar
thanks for the suggestion -- however, looks like this is even slower. With the small data set I'm using, my aggregate function takes ~ 9 seconds and the colStats.mean() takes ~ 1 minute. However, I can't get it to run with the Kyro serializer -- I get the error:

PermGen issues on AWS

2015-01-09 Thread Joe Wass
I'm running on an AWS cluster of 10 x m1.large (64 bit, 7.5 GiB RAM). FWIW I'm using the Flambo Clojure wrapper which uses the Java API but I don't think that should make any difference. I'm running with the following command: spark/bin/spark-submit --class mything.core --name My Thing --conf

Queue independent jobs

2015-01-09 Thread Anders Arpteg
Hey, Lets say we have multiple independent jobs that each transform some data and store in distinct hdfs locations, is there a nice way to run them in parallel? See the following pseudo code snippet: dateList.map(date = sc.hdfsFile(date).map(transform).saveAsHadoopFile(date)) It's unfortunate

Re: Data locality running Spark on Mesos

2015-01-09 Thread Michael V Le
Hi Tim, Thanks for your response. The benchmark I used just reads data in from HDFS and builds the Linear Regression model using methods from the MLlib. Unfortunately, for various reasons, I can't open the source code for the benchmark at this time. I will try to replicate the problem using

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-09 Thread Cheng Lian
Hey Nathan, Thanks for sharing, this is a very interesting post :) My comments are inlined below. Cheng On 1/7/15 11:53 AM, Nathan McCarthy wrote: Hi, I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via rdd.mapPartitions(…). Using the latest release 1.2.0. Simple

Cleaning up spark.local.dir automatically

2015-01-09 Thread michael.england
Hi, Is there a way of automatically cleaning up the spark.local.dir after a job has been run? I have noticed a large number of temporary files have been stored here and are not cleaned up. The only solution I can think of is to run some sort of cron job to delete files older than a few days. I