Re: How to set UI port #?

2015-01-12 Thread Prannoy
Set the port using spconf.set(spark.ui.port,); where, is any port spconf is your spark configuration object. On Sun, Jan 11, 2015 at 2:08 PM, YaoPau [via Apache Spark User List] ml-node+s1001560n21083...@n3.nabble.com wrote: I have multiple Spark Streaming jobs running all day, and

Re: Problem with building spark-1.2.0

2015-01-12 Thread Kartheek.R
Hi, This is what I am trying to do: karthik@s4:~/spark-1.2.0$ SPARK_HADOOP_VERSION=2.3.0 sbt/sbt clean Using /usr/lib/jvm/java-7-oracle as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. [info] Loading project definition from /home/karthik/spark-1.2.0/project/project

GraphX vs GraphLab

2015-01-12 Thread Madabhattula Rajesh Kumar
Hi Team, Is any one done comparison(pros and cons ) study between GraphX ad GraphLab. Could you please let me know any links for this comparison. Regards, Rajesh

Re: Failed to save RDD as text file to local file system

2015-01-12 Thread Prannoy
Have you tried simple giving the path where you want to save the file ? For instance in your case just do *r.saveAsTextFile(home/cloudera/tmp/out1) * Dont use* file* This will create a folder with name out1. saveAsTextFile always write by making a directory, it does not write data into a

how to select the first row in each group by group?

2015-01-12 Thread LinQili
Hi all:I am using spark sql to read and write hive tables. But There is a issue that how to select the first row in each group by group?In hive, we could write hql like this:SELECT imeiFROM (SELECT imei, row_number() over (PARTITION BY imei ORDER BY login_time

Manually trigger RDD map function without action

2015-01-12 Thread Kevin Jung
Hi all Is there efficient way to trigger RDD transformations? I'm now using count action to achieve this. Best regards Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Manually-trigger-RDD-map-function-without-action-tp21094.html Sent from the Apache

Re: creating a single kafka producer object for all partitions

2015-01-12 Thread Hafiz Mujadid
Actually this code is producing error leader not found exception. I am unable to find the reason On Mon, Jan 12, 2015 at 4:03 PM, kevinkim [via Apache Spark User List] ml-node+s1001560n21098...@n3.nabble.com wrote: Well, you can use coalesce() to decrease number of partition to 1. (It will

Re: Does DecisionTree model in MLlib deal with missing values?

2015-01-12 Thread Sean Owen
On Sun, Jan 11, 2015 at 9:46 PM, Christopher Thom christopher.t...@quantium.com.au wrote: Is there any plan to extend the data types that would be accepted by the Tree models in Spark? e.g. Many models that we build contain a large number of string-based categorical factors. Currently the

RowMatrix multiplication

2015-01-12 Thread Alex Minnaar
I have a rowMatrix on which I want to perform two multiplications. The first is a right multiplication with a local matrix which is fine. But after that I also wish to right multiply the transpose of my rowMatrix with a different local matrix. I understand that there is no functionality to

Re: Manually trigger RDD map function without action

2015-01-12 Thread Cody Koeninger
If you don't care about the value that your map produced (because you're not already collecting or saving it), then is foreach more appropriate to what you're doing? On Mon, Jan 12, 2015 at 4:08 AM, kevinkim kevin...@apache.org wrote: Hi, answer from another Kevin. I think you may already

Re: creating a single kafka producer object for all partitions

2015-01-12 Thread Cody Koeninger
You should take a look at https://issues.apache.org/jira/browse/SPARK-4122 which is implementing writing to kafka in a pretty similar way (make a new producer inside foreachPartition) On Mon, Jan 12, 2015 at 5:24 AM, Sean Owen so...@cloudera.com wrote: Leader-not-found suggests a problem with

Re: Removing JARs from spark-jobserver

2015-01-12 Thread Fernando O.
just an FYI: you can configure that using spark.jobserver.filedao.rootdir On Mon, Jan 12, 2015 at 1:52 AM, abhishek reachabhishe...@gmail.com wrote: Nice! Good to know On 11 Jan 2015 21:10, Sasi [via Apache Spark User List] [hidden email] http:///user/SendEmail.jtp?type=nodenode=21089i=0

creating a single kafka producer object for all partitions

2015-01-12 Thread Hafiz Mujadid
Hi experts! I have a schemaRDD of messages to be pushed in kafka. So I am using following piece of code to do that rdd.foreachPartition(itr = { val props = new Properties() props.put(metadata.broker.list, brokersList)

Re: Play Scala Spark Exmaple

2015-01-12 Thread Eduardo Cusa
The EC2 versión is 1.1.0 and this is my build.sbt: libraryDependencies ++= Seq( jdbc, anorm, cache, org.apache.spark %% spark-core % 1.1.0, com.typesafe.akka %% akka-actor % 2.2.3, com.typesafe.akka %% akka-slf4j % 2.2.3, org.apache.spark %%

Re: calculating the mean of SparseVector RDD

2015-01-12 Thread Rok Roskar
This was without using Kryo -- if I use kryo, I got errors about buffer overflows (see above): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5, required: 8 Just calling colStats doesn't actually compute those statistics, does it? It looks like the computation is only

Re: Problem with building spark-1.2.0

2015-01-12 Thread Sean Owen
The problem is there in the logs. When it went to clone some code, something went wrong with the proxy: Received HTTP code 407 from proxy after CONNECT Probably you have an HTTP proxy and you have not authenticated. It's specific to your environment. Although it's unrelated, I'm curious how

Re: Spark SQL Parquet - data are reading very very slow

2015-01-12 Thread kaushal
yes , i am also facing same problem .. please any one help to get fast execution. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Parquet-data-are-reading-very-very-slow-tp21061p21100.html Sent from the Apache Spark User List mailing list

Re: creating a single kafka producer object for all partitions

2015-01-12 Thread Sean Owen
Leader-not-found suggests a problem with zookeeper config. It depends a lot on the specifics of your error. But this is really a Kafka question, better for the Kafka list. On Mon, Jan 12, 2015 at 11:10 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Actually this code is producing error leader

Re: Manually trigger RDD map function without action

2015-01-12 Thread kevinkim
Hi, answer from another Kevin. I think you may already know it, 'transformation' in spark (http://spark.apache.org/docs/latest/programming-guide.html#transformations) will be done in 'lazy' way, when you trigger 'actions'. (http://spark.apache.org/docs/latest/programming-guide.html#actions) So

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-12 Thread Aniket Bhatnagar
Meanwhile, I have submitted a pull request ( https://github.com/awslabs/emr-bootstrap-actions/pull/37) that allows users to place their jars ahead of all other jars in spark classpath. This should serve as a temporary workaround for all class conflicts. Thanks, Aniket On Mon Jan 05 2015 at

Spark does not loop through a RDD.map

2015-01-12 Thread rkgurram
Hi, I am observing some weird behavior with spark, it might be my mis-interpretation of some fundamental concepts but I have look at it for 3 days and have not been able to solve it. The source code is pretty long and complex so instead of posting it, I will try to articulate the problem. I am

Re: Failed to save RDD as text file to local file system

2015-01-12 Thread Sean Owen
I think you're confusing HDFS paths and local paths. You are cd'ing to a directory and seem to want to write output there, but your path has no scheme and defaults to being an HDFS path. When you use file: you seem to have a permission error (perhaps). On Mon, Jan 12, 2015 at 4:21 PM, NingjunWang

Re: How to use memcached with spark

2015-01-12 Thread octavian.ganea
I am trying to use it, but without success. Any sample code that works with Spark would be highly appreciated. :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-memcached-with-spark-tp13409p21103.html Sent from the Apache Spark User List mailing

Spark executors resources. Blocking?

2015-01-12 Thread Luis Guerra
Hello all, I have a naive question regarding how spark uses the executors in a cluster of machines. Imagine the scenario in which I do not know the input size of my data in execution A, so I set Spark to use 20 (out of my 25 nodes, for instance). At the same time, I also launch a second execution

Re: configuring spark.yarn.driver.memoryOverhead on Spark 1.2.0

2015-01-12 Thread David McWhorter
Hi Ganelin, sorry if it wasn't clear from my previous email, but that is how I am creating a spark context. I just didn't write out the lines where I create the new SparkConf and SparkContext. I am also upping the driver memory when running. Thanks, David On 01/12/2015 11:12 AM, Ganelin,

Re: Spark does not loop through a RDD.map

2015-01-12 Thread Cody Koeninger
At a quick glance, I think you're misunderstanding some basic features. http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations Map is a transformation, it is lazy. You're not calling any action on the result of map. Also, closing over a mutable variable (like idx or

Re: configuring spark.yarn.driver.memoryOverhead on Spark 1.2.0

2015-01-12 Thread Sean Owen
Isn't the syntax --conf property=value? http://spark.apache.org/docs/latest/configuration.html Yes, I think setting it after the driver is running is of course too late. On Mon, Jan 12, 2015 at 4:01 PM, David McWhorter mcwhor...@ccri.com wrote: Hi all, I'm trying to figure out how to set

Re: RowMatrix multiplication

2015-01-12 Thread Alex Minnaar
That's not quite what I'm looking for. Let me provide an example. I have a rowmatrix A that is nxm and I have two local matrices b and c. b is mx1 and c is nx1. In my spark job I wish to perform the following two computations A*b and A^T*c I don't think this is possible without being

Re: Getting Output From a Cluster

2015-01-12 Thread Su She
Hello Everyone, Quick followup, is there any way I can append output to one file rather then create a new directory/file every X milliseconds? Thanks! Suhas Shekar University of California, Los Angeles B.A. Economics, Specialization in Computing 2014 On Thu, Jan 8, 2015 at 11:41 PM, Su She

Re: /tmp directory fills up

2015-01-12 Thread Marcelo Vanzin
Hi Alessandro, You can look for a log line like this in your driver's output: 15/01/12 10:51:01 INFO storage.DiskBlockManager: Created local directory at /data/yarn/nm/usercache/systest/appcache/application_1421081007635_0002/spark-local-20150112105101-4f3d If you're deploying your application

ReliableKafkaReceiver stopped receiving data after WriteAheadLogBasedBlockHandler throws TimeoutException

2015-01-12 Thread Max Xu
Hi all, I am running a Spark streaming application with ReliableKafkaReceiver (Spark 1.2.0). Constantly I was getting the following exception: 15/01/12 19:07:06 ERROR receiver.BlockGenerator: Error in block pushing thread java.util.concurrent.TimeoutException: Futures timed out after [30

Re: RowMatrix multiplication

2015-01-12 Thread Alex Minnaar
?Good idea! Join each element of c with the corresponding row of A, multiply through, then reduce. I'll give this a try. Thanks, Alex From: Reza Zadeh r...@databricks.com Sent: Monday, January 12, 2015 3:05 PM To: Alex Minnaar Cc:

Re: How to recovery application running records when I restart Spark master?

2015-01-12 Thread ChongTang
Is there any body can help me with this? Thank you very much! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-recovery-application-running-records-when-I-restart-Spark-master-tp21088p21108.html Sent from the Apache Spark User List mailing list

no snappyjava in java.library.path

2015-01-12 Thread Dan Dong
Hi, My Spark job failed with no snappyjava in java.library.path as: Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1857) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at

Re: no snappyjava in java.library.path

2015-01-12 Thread David Rosenstrauch
I ran into this recently. Turned out we had an old org-xerial-snappy.properties file in one of our conf directories that had the setting: # Disables loading Snappy-Java native library bundled in the # snappy-java-*.jar file forcing to load the Snappy-Java native # library from the

Re: Web Service + Spark

2015-01-12 Thread Robert C Senkbeil
If you would like to work with an API, you can use the Spark Kernel found here: https://github.com/ibm-et/spark-kernel The kernel provides an API following the IPython message protocol as well as a client library that can be used with Scala applications. The kernel can also be plugged into the

Spark Framework handling of Mesos master change

2015-01-12 Thread Ethan Wolf
We are running Spark and Spark Streaming on Mesos (with multiple masters for HA). At launch, our Spark jobs successfully look up the current Mesos master from zookeeper and spawn tasks. However, when the Mesos master changes while the spark job is executing, the spark driver seems to interact

How to recovery application running records when I restart Spark master?

2015-01-12 Thread Chong Tang
Hi all, Due to some reasons, I restarted Spark master node. Before I restart it, there were some application running records at the bottom of the master web page. But they are gone after I restart the master node. The records include application name, running time, status, and so on. I am sure

Re: Getting Output From a Cluster

2015-01-12 Thread Akhil Das
There is no direct way of doing that. If you need a Single file for every batch duration, then you can repartition the data to 1 before saving. Another way would be to use hadoop's copy merge command/api(available from 2.0 versions) On 13 Jan 2015 01:08, Su She suhsheka...@gmail.com wrote: Hello

Re: Getting Output From a Cluster

2015-01-12 Thread Su She
Okay, thanks Akhil! Suhas Shekar University of California, Los Angeles B.A. Economics, Specialization in Computing 2014 On Mon, Jan 12, 2015 at 1:24 PM, Akhil Das ak...@sigmoidanalytics.com wrote: There is no direct way of doing that. If you need a Single file for every batch duration, then

Re: How to recovery application running records when I restart Spark master?

2015-01-12 Thread Cody Koeninger
http://spark.apache.org/docs/latest/monitoring.html http://spark.apache.org/docs/latest/configuration.html#spark-ui spark.eventLog.enabled On Mon, Jan 12, 2015 at 3:00 PM, ChongTang ct...@virginia.edu wrote: Is there any body can help me with this? Thank you very much! -- View this

Re: Spark Framework handling of Mesos master change

2015-01-12 Thread Tim Chen
Hi Ethan, How are you specifying the master to spark? Able to recover from master failover is already handled by the underlying Mesos scheduler, but you have to use zookeeper instead of directly passing in the master uris. Tim On Mon, Jan 12, 2015 at 12:44 PM, Ethan Wolf

Re: How to recovery application running records when I restart Spark master?

2015-01-12 Thread Chong Tang
Thank you, Cody! Actually, I have enabled this option, and I saved logs into Hadoop file system. The problem is, how can I get the duration of an application? The attached file is the log I copied from HDFS. On Mon, Jan 12, 2015 at 4:36 PM, Cody Koeninger c...@koeninger.org wrote:

Re: How to recovery application running records when I restart Spark master?

2015-01-12 Thread Chong Tang
Thank you, Cody! Actually, I have enabled this option, and I saved logs into Hadoop file system. The problem is, how can I get the duration of an application? The attached file is the log I copied from HDFS. On Mon, Jan 12, 2015 at 4:36 PM, Cody Koeninger c...@koeninger.org wrote:

Re: Discrepancy in PCA values

2015-01-12 Thread Xiangrui Meng
Could you compare V directly and tell us more about the difference you saw? The column of V should be the same subject to signs. For example, the first column of V could be either [0.8, -0.6, 0.0] or [-0.8, 0.6, 0.0]. -Xiangrui On Sat, Jan 10, 2015 at 8:08 PM, Upul Bandara upulband...@gmail.com

How does unmanaged memory work with the executor memory limits?

2015-01-12 Thread Michael Albert
Greetings! My executors apparently are being terminated because they are running beyond physical memory limits according to the yarn-hadoop-nodemanager logs on the worker nodes (/mnt/var/log/hadoop on AWS EMR).  I'm setting the driver-memory to 8G.However, looking at stdout in userlogs, I can

Re: How does unmanaged memory work with the executor memory limits?

2015-01-12 Thread Marcelo Vanzin
Short answer: yes. Take a look at: http://spark.apache.org/docs/latest/running-on-yarn.html Look for memoryOverhead. On Mon, Jan 12, 2015 at 2:06 PM, Michael Albert m_albert...@yahoo.com.invalid wrote: Greetings! My executors apparently are being terminated because they are running beyond

Re: How to recovery application running records when I restart Spark master?

2015-01-12 Thread Cody Koeninger
Sorry, slightly misunderstood the question. I'm not sure if there's a way to make the master UI read old log files after a restart, but the log files themselves are human readable text. If you just want application duration, the start and stop are timestamped, look for lines like this in

including the spark-mllib in build.sbt

2015-01-12 Thread Jianguo Li
Hi, I am trying to build my own scala project using sbt. The project is dependent on both spark-score and spark-mllib. I included the following two dependencies in my build.sbt file libraryDependencies += org.apache.spark %% spark-mllib % 1.1.1 libraryDependencies += org.apache.spark %%

How Spark Calculate partition size automatically

2015-01-12 Thread rajnish
Hi, When I am running a job, that is loading the data from Cassandra, Spark has created almost 9million partitions. How spark decide the partition count? I have read from one of the presentation that it is good to have 1000 to 10,000 partitions. Regards Raj -- View this message in context:

Re: calculating the mean of SparseVector RDD

2015-01-12 Thread Xiangrui Meng
No, colStats() computes all summary statistics in one pass and store the values. It is not lazy. On Mon, Jan 12, 2015 at 4:42 AM, Rok Roskar rokros...@gmail.com wrote: This was without using Kryo -- if I use kryo, I got errors about buffer overflows (see above):

Re: including the spark-mllib in build.sbt

2015-01-12 Thread Xiangrui Meng
I don't know the root cause. Could you try including only libraryDependencies += org.apache.spark %% spark-mllib % 1.1.1 It should be sufficient because mllib depends on core. -Xiangrui On Mon, Jan 12, 2015 at 2:27 PM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, I am trying to build my

Re: OOM exception during row deserialization

2015-01-12 Thread Pala M Muthaia
Does anybody have insight on this? Thanks. On Fri, Jan 9, 2015 at 6:30 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, I am using Spark 1.0.1. I am trying to debug a OOM exception i saw during a join step. Basically, i have a RDD of rows, that i am joining with another RDD of

Re: Broadcast joins on RDD

2015-01-12 Thread Reza Zadeh
First, you should collect().toMap() the small RDD, then you should use broadcast followed by a map to do a map-side join http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf (slide 10 has an example). Spark SQL also does it by default for tables

Re: Manually trigger RDD map function without action

2015-01-12 Thread Sven Krasser
Hey Kevin, I assume you want to trigger the map() for a side effect (since you don't care about the result). To Cody's point, you can use foreach() *instead* of map(). So instead of e.g. x.map(a = foo(a)).foreach(a = a), you'd run x.foreach(a = foo(a)). Best, -Sven On Mon, Jan 12, 2015 at 5:13

quickly counting the number of rows in a partition?

2015-01-12 Thread Kevin Burton
Is there a way to compute the total number of records in each RDD partition? So say I had 4 partitions.. I’d want to have partition 0: 100 records partition 1: 104 records partition 2: 90 records partition 3: 140 records Kevin -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog:

Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2015-01-12 Thread Simon Chan
Also a ready-to-use server with Spark MLlib: http://docs.prediction.io/recommendation/quickstart/ The source code is here: https://github.com/PredictionIO/PredictionIO/tree/develop/templates/scala-parallel-recommendation Simon On Sun, Nov 30, 2014 at 12:17 PM, Pat Ferrel p...@occamsmachete.com

Re: [mllib] GradientDescent requires huge memory for storing weight vector

2015-01-12 Thread Reza Zadeh
I guess you're not using too many features (e.g. 10m), just that hashing the index makes it look that way, is that correct? If so, the simple dictionary that maps your feature index - rank can be broadcast and used everywhere, so you can pass mllib just the feature's rank as its index. Reza On

[mllib] GradientDescent requires huge memory for storing weight vector

2015-01-12 Thread Tianshuo Deng
Hi, Currently in GradientDescent.scala, weights is constructed as a dense vector: initialWeights = Vectors.dense(new Array[Double](numFeatures)) And the numFeatures is determined in the loadLibSVMFile as the max index of features. But in the case of using hash function to compute feature

Re: train many decision tress with a single spark job

2015-01-12 Thread Josh Buffum
Sean, Thanks for the response. Is there some subtle difference between one model partitioned by N users or N models per each 1 user? I think I'm missing something with your question. Looping through the RDD filtering one user at a time would certainly give me the response that I am hoping for

Re: Manually trigger RDD map function without action

2015-01-12 Thread Kevin Jung
Cody said If you don't care about the value that your map produced (because you're not already collecting or saving it), then is foreach more appropriate to what you're doing? but I can not see it from this thread. Anyway, I performed small benchmark to test what function is the most efficient

Re: quickly counting the number of rows in a partition?

2015-01-12 Thread Sven Krasser
Yes, using mapPartitionsWithIndex, e.g. in PySpark: sc.parallelize(xrange(0,1000), 4).mapPartitionsWithIndex(lambda idx,iter: ((idx, len(list(iter))),)).collect() [(0, 250), (1, 250), (2, 250), (3, 250)] (This is not the most efficient way to get the length of an iterator, but you get the

Re: train many decision tress with a single spark job

2015-01-12 Thread Josh Buffum
You are right... my code example doesn't work :) I actually do want a decision tree per user. So, for 1 million users, I want 1 million trees. We're training against time series data, so there are still quite a few data points per users. My previous message where I mentioned RDDs with no length

Re: OOM exception during row deserialization

2015-01-12 Thread Sven Krasser
Hey Pala, I also find it very hard to get to the bottom of memory issues such as this one based on what's in the logs (so if you come up with some findings, then please share here). In the interim, here are a few things you can try: - Provision more memory per executor. While in theory (and

Broadcast joins on RDD

2015-01-12 Thread Pala M Muthaia
Hi, How do i do broadcast/map join on RDDs? I have a large RDD that i want to inner join with a small RDD. Instead of having the large RDD repartitioned and shuffled for join, i would rather send a copy of a small RDD to each task, and then perform the join locally. How would i specify this in

Re: train many decision tress with a single spark job

2015-01-12 Thread Sean Owen
A model partitioned by users? I mean that if you have a million users surely you don't mean to build a million models. There would be little data per user right? Sounds like you have 0 sometimes. You would typically be generalizing across users not examining them in isolation. Models are built

Re: Trouble with large Yarn job

2015-01-12 Thread Sven Krasser
Anders, This could be related to this open ticket: https://issues.apache.org/jira/browse/SPARK-5077. A call to coalesce() also fixed that for us as a stopgap. Best, -Sven On Mon, Jan 12, 2015 at 10:18 AM, Anders Arpteg arp...@spotify.com wrote: Yes sure Sandy, I've checked the logs and it's

Running Spark application from command line

2015-01-12 Thread Arun Lists
I have a Spark application that was assembled using sbt 0.13.7, Scala 2.11, and Spark 1.2.0. In build.sbt, I am running on Mac OSX Yosemite. I use provided for the Spark dependencies. I can run the application fine within sbt. I run into problems when I try to run it from the command line. Here

Re: Problem with building spark-1.2.0

2015-01-12 Thread Rapelly Kartheek
Yes, this proxy problem is resolved. *how your build refers tohttps://github.com/ScrapCodes/sbt-pom-reader.git https://github.com/ScrapCodes/sbt-pom-reader.git I don't see thisrepo the project code base.* I manually downloaded the sbt-pom-reader directory and moved into .sbt/0.13/staging/*/

Re: How to create an empty RDD with a given type?

2015-01-12 Thread Xuelin Cao
Got it, thanks! On Tue, Jan 13, 2015 at 2:00 PM, Justin Yip yipjus...@gmail.com wrote: Xuelin, There is a function called emtpyRDD under spark context http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext which serves this purpose. Justin On Mon, Jan

Re: How to create an empty RDD with a given type?

2015-01-12 Thread Justin Yip
Xuelin, There is a function called emtpyRDD under spark context http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext which serves this purpose. Justin On Mon, Jan 12, 2015 at 9:50 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, I'd like to create a

Re: Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-12 Thread lihu
How about your scene? do you need use lots of Broadcast? If not, It will be better to focus more on other thing. At this time, there is not more better method than TorrentBroadcast. Though one-by-one, but after one node get the data, it can act as the data source immediately.

Creating RDD from only few columns of a Parquet file

2015-01-12 Thread Ajay Srivastava
Hi,I am trying to read a parquet file using -val parquetFile = sqlContext.parquetFile(people.parquet) There is no way to specify that I am interested in reading only some columns from disk. For example, If the parquet file has 10 columns and want to read only 3 columns from disk. We have done

configuring spark.yarn.driver.memoryOverhead on Spark 1.2.0

2015-01-12 Thread David McWhorter
Hi all, I'm trying to figure out how to set this option: spark.yarn.driver.memoryOverhead on Spark 1.2.0. I found this helpful overview http://apache-spark-user-list.1001560.n3.nabble.com/Stable-spark-streaming-app-td14105.html#a14476, which suggests to launch with

Re: configuring spark.yarn.driver.memoryOverhead on Spark 1.2.0

2015-01-12 Thread Ganelin, Ilya
There are two related options: To solve your problem directly try: val conf = new SparkConf().set(spark.yarn.driver.memoryOverhead, 1024) val sc = new SparkContext(conf) And the second, which increases the overall memory available on the driver, as part of your spark-submit script add:

RE: Failed to save RDD as text file to local file system

2015-01-12 Thread NingjunWang
Prannoy I tried this r.saveAsTextFile(home/cloudera/tmp/out1), it return without error. But where does it saved to? The folder “/home/cloudera/tmp/out1” is not cretaed. I also tried the following cd /home/cloudera/tmp/ spark-shell scala val r = sc.parallelize(Array(a, b, c)) scala

can I buffer flatMap input at each worker node?

2015-01-12 Thread maherrt
Dear All what i want to do is : as the data is partitioned on many worker nodes I want to be able to process this partition of data as a whole on each partition and then produce my output using flatMap for example. so can I loads all of the input records on one worker node and emitting any output

Re: Trouble with large Yarn job

2015-01-12 Thread Anders Arpteg
Yes sure Sandy, I've checked the logs and it's not a OOM issue. I've actually been able to solve the problem finally, and it seems to be an issue with too many partitions. The repartitioning I tried initially did so after the union, and then it's too late. By repartitioning as early as possible,

Re: can I buffer flatMap input at each worker node?

2015-01-12 Thread Sven Krasser
Not sure I understand correctly, but it sounds like you're looking for mapPartitions(). -Sven On Mon, Jan 12, 2015 at 10:17 AM, maherrt mahe...@hotmail.com wrote: Dear All what i want to do is : as the data is partitioned on many worker nodes I want to be able to process this partition of

Pattern Matching / Equals on Case Classes in Spark Not Working

2015-01-12 Thread Rosner, Frank (Allianz SE)
Dear Spark Users, I googled the web for several hours now but I don't find a solution for my problem. So maybe someone from this list can help. I have an RDD of case classes, generated from CSV files with Spark. When I used the distinct operator, there were still duplicates. So I investigated

Re: Issue writing to Cassandra from Spark

2015-01-12 Thread Ankur Srivastava
Hi Akhil, Thank you for the pointers. Below is how we are saving data to Cassandra. javaFunctions(rddToSave).writerBuilder(datapipelineKeyspace, datapipelineOutputTable, mapToRow(Sample.class)) The data we are saving at this stage is ~200 million rows. How do we control application threads

Re: Pattern Matching / Equals on Case Classes in Spark Not Working

2015-01-12 Thread Matei Zaharia
Is this in the Spark shell? Case classes don't work correctly in the Spark shell unfortunately (though they do work in the Scala shell) because we change the way lines of code compile to allow shipping functions across the network. The best way to get case classes in there is to compile them