Add to the spark users list

2015-01-06 Thread Bilna Govind
Hi, Organization name: Amrita Center for Cyber Security Systems and Networks URL: https://www.amrita.edu/center/cyber-security We use Spark for BigData analytics and ML/Data Mining. Spark Streaming in IoT Platform -- Regards, Bilna P

Re: Guava 11 dependency issue in Spark 1.2.0

2015-01-06 Thread Sean Owen
-dev Guava was not downgraded to 11. That PR was not merged. It was part of a discussion about, indeed, what to do about potential Guava version conflicts. Spark uses Guava, but so does Hadoop, and so do user programs. Spark uses 14.0.1 in fact:

Re: Guava 11 dependency issue in Spark 1.2.0

2015-01-06 Thread Niranda Perera
Hi Sean, My mistake, Guava 11 dependency came from the hadoop-commons indeed. I'm running the following simple app in spark 1.2.0 standalone local cluster (2 workers) with Hadoop 1.2.1 public class AvroSparkTest { public static void main(String[] args) throws Exception { SparkConf

Re: Guava 11 dependency issue in Spark 1.2.0

2015-01-06 Thread Sean Owen
Oh, are you actually bundling Hadoop in your app? that may be the problem. If you're using stand-alone mode, why include Hadoop? In any event, Spark and Hadoop are intended to be 'provided' dependencies in the app you send to spark-submit. On Tue, Jan 6, 2015 at 10:15 AM, Niranda Perera

Guava 11 dependency issue in Spark 1.2.0

2015-01-06 Thread Niranda Perera
Hi, I have been running a simple Spark app on a local spark cluster and I came across this error. Exception in thread main java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; at org.apache.spark.util.collection.OpenHashSet.org

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Davies Liu
I had ran your scripts in 5 nodes ( 2 CPUs, 8G mem) cluster, can not reproduce your failure. Should I test it with big memory node? On Mon, Jan 5, 2015 at 4:00 PM, Sven Krasser kras...@gmail.com wrote: Thanks for the input! I've managed to come up with a repro of the error with test data only

Re: Timeout Exception in standalone cluster

2015-01-06 Thread Akhil Das
That's an issue with your firewall (more likely hostnames issue in /etc/hosts), You may find the following posts helpful - http://stackoverflow.com/questions/27039954/intermittent-timeout-exception-using-spark - http://koobehub.wordpress.com/2014/09/29/spark-the-standalone-cluster-deployment/

Re: Saving partial (top 10) DStream windows to hdfs

2015-01-06 Thread Akhil Das
You can try something like: *val top10 = your_stream.mapPartitions(rdd = rdd.take(10))* Thanks Best Regards On Mon, Jan 5, 2015 at 11:08 PM, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, I am counting values in each window and find the top values and want to save only the top 10

Re: Spark error in execution

2015-01-06 Thread Daniel Darabos
Hello! I just had a very similar stack trace. It was caused by an Akka version mismatch. (From trying to use Play 2.3 with Spark 1.1 by accident instead of 1.2.) On Mon, Nov 24, 2014 at 7:15 PM, Blackeye black...@iit.demokritos.gr wrote: I created an application in spark. When I run it with

Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-06 Thread Pankaj Narang
Good luck. Let me know If I can assist you further Regards -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context:

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Pankaj Narang
I suggest to create uber jar instead. check my thread for the same http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-td20926.html Regards -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Akhil Das
Or you can use: sc.addJar(/path/to/your/datastax.jar) Thanks Best Regards On Tue, Jan 6, 2015 at 5:53 PM, bchazalet bchaza...@companywatch.net wrote: I don't know much about spark-jobserver, but you can set jars programatically using the method setJars on SparkConf. Looking at your code it

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread bchazalet
I don't know much about spark-jobserver, but you can set jars programatically using the method setJars on SparkConf. Looking at your code it seems that you're importing classes from com.datastax.spark.connector._ to load data from cassandra, so you may need to add that datastax jar to your

streamSQL - is it available or is it in POC ?

2015-01-06 Thread tfrisk
Hi, Just wondering whether this is released yet and if so on which version of Spark ? Many Thanks, Thomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streamSQL-is-it-available-or-is-it-in-POC-tp20993.html Sent from the Apache Spark User List

Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-06 Thread Christophe Billiard
Thanks Pankaj for the assembly plugin tip. Yes there is a version mismatch of akka actor between Spark 1.1.1 and akka-http/akka-stream (2.2.3 versus 2.3.x). After some digging, I see 4 options for this problem (in case others encounter it): 1) Upgrade to Spark 1.2.0, the same code will work (not

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Sasi
Boris, Thank you for your suggestion. I used following code and still facing the same issue - val conf = new SparkConf(true).set(spark.cassandra.connection.host, 127.0.0.1) .setAppName(jobserver test demo)

Trouble with large Yarn job

2015-01-06 Thread Anders Arpteg
Hey, I have a job that keeps failing if too much data is processed, and I can't see how to get it working. I've tried repartitioning with more partitions and increasing amount of memory for the executors (now about 12G and 400 executors. Here is a snippets of the first part of the code, which

Re: Finding most occurrences in a JSON Nested Array

2015-01-06 Thread adstan
Many thanks Pankaj, I've got it working. For completeness, here's the whole segment (including the printout at diff stages): -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20996.html Sent from the

I think I am almost lost in the internals of Spark

2015-01-06 Thread Todd
I am a bit new to Spark, except that I tried simple things like word count, and the examples given in the spark sql programming guide. Now, I am investigating the internals of Spark, but I think I am almost lost, because I could not grasp a whole picture what spark does when it executes the

How to limit the number of concurrent tasks per node?

2015-01-06 Thread Pengcheng YIN
Hi Pro, One map() operation in my Spark APP takes an RDD[A] as input and map each element in RDD[A] using a custom mapping function func(x:A):B to another object of type B. I received lots of OutOfMemory error, and after some debugging I find this is because func() requires significant

Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Max Xu
Hi all, I have a Spark streaming application that ingests data from a Kafka topic and persists received data to Hbase. It works fine with Spark 1.1.1 in YARN cluster mode. Basically, I use the following code to persist each partition of each RDD to Hbase: @Override void

Problem getting Spark running on a Yarn cluster

2015-01-06 Thread Sharon Rapoport
Hello, We have hadoop 2.6.0 and Yarn set up on ec2. Trying to get spark 1.1.1 running on the Yarn cluster. I have of course googled around and found that this problem is solved for most after removing the line including 127.0.1.1 from /etc/hosts. This hasn’t seemed to solve this for me.

Pyspark Interactive shell

2015-01-06 Thread Naveen Kumar Pokala
Hi, Anybody tried to connect to spark cluster( on UNIX machines) from windows interactive shell ? -Naveen.

Re: Finding most occurrences in a JSON Nested Array

2015-01-06 Thread Pankaj Narang
Thats great. I was not having access on the developer machine so sent you the psuedo code only. Happy to see its working. If you need any more help related to spark let me know anytime. -- View this message in context:

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread bchazalet
It does not look like you're supposed to fiddle with the SparkConf and even SparkContext in a 'job' (again, I don't know much about jobserver), as you're given a SparkContext as parameter in the build method. I guess jobserver initialises the SparkConf and SparkContext itself when it first

Location of logs in local mode

2015-01-06 Thread Brett Meyer
I¹m submitting a script using spark-submit in local mode for testing, and I¹m having trouble figuring out where the logs are stored. The documentation indicates that they should be in the work folder in the directory in which Spark lives on my system, but I see no such folder there. I¹ve set the

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Todd Nist
*@Sasi* You should be able to create a job something like this: package io.radtech.spark.jobserver import java.util.UUID import org.apache.spark.{ SparkConf, SparkContext } import org.apache.spark.rdd.RDD import org.joda.time.DateTime import com.datastax.spark.connector.types.TypeConverter

Is there a way to read a parquet database without generating an RDD

2015-01-06 Thread Steve Lewis
I have an application where a function needs access to the results of a select from a parquet database. Creating a JavaSQLContext and from it a JavaSchemaRDD as shown below works but the parallelism is not needed - a simple JDBC call would work - Are there alternative non-parallel ways to

Re: RDD Moving Average

2015-01-06 Thread Sean Owen
So you want windows covering the same length of time, some of which will be fuller than others? You could, for example, simply bucket the data by minute to get this kind of effect. If you an RDD[Ticker], where Ticker has a timestamp in ms, you could: tickerRDD.groupBy(ticker = (ticker.timestamp /

Re: Multiple Spark Streaming receiver model

2015-01-06 Thread Silvio Fiorito
Hi Manjul, Each StreamingContext will have its own batch size. If that doesn’t work for the different sources you have then you would have to create different streaming apps. You can only create a new StreamingContext in the same Spark app, once you’ve stopped the previous one. Spark

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Davies Liu
I still can not reproduce it with 2 nodes (4 CPUs). Your repro.py could be faster (10 min) than before (22 min): inpdata.map(lambda (pc, x): (x, pc=='p' and 2 or 1)).reduceByKey(lambda x, y: x|y).filter(lambda (x, pc): pc==3).collect() (also, no cache needed anymore) Davies On Tue, Jan 6,

Multiple Spark Streaming receiver model

2015-01-06 Thread manjuldixit
Hi, We have a requirement of receiving live input messages from RabbitMQ and process them into micro batches. For this we have selected SparkStreaming and we have written a connector for RabbitMQ receiver and Spark streaming, it is working fine. Now the main requirement is to receive different

HDFS_DELEGATION_TOKEN errors after switching Spark Contexts

2015-01-06 Thread Ganelin, Ilya
Hi all. In order to get Spark to properly release memory during batch processing as a workaround to issue https://issues.apache.org/jira/browse/SPARK-4927 I tear down and re-initialize the spark context with : context.stop() and context = new SparkContext() The problem I run into is that

Re: MLLIB and Openblas library in non-default dir

2015-01-06 Thread Tomas Hudik
thanks Xiangrui I'll try it. BTW: spark-submit is a standalone program (bin/spark-submit). Therefore, JVM has to be executed after spark-submit script Am I correct? On Mon, Jan 5, 2015 at 10:35 PM, Xiangrui Meng men...@gmail.com wrote: It might be hard to do that with spark-submit, because

Re: Reading from a centralized stored

2015-01-06 Thread Cody Koeninger
No, most rdds partition input data appropriately. On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter franc.car...@rozettatech.com wrote: One more question, to be clarify. Will every node pull in all the data ? thanks On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org wrote: If

confidence/probability for prediction in MLlib

2015-01-06 Thread Jianguo Li
Hi, A while ago, somebody asked about getting a confidence value of a prediction with MLlib's implementation of Naive Bayes's classification. I was wondering if there is any plan in the near future for the predict function to return both a label and a confidence/probability? Or could the private

Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
Ah, so it's rdd specific - that would make sense. For those systems where it is possible to extract sensible susbets the rdds do so. My use case, which is probably biasing my thinking is DynamoDb which I don't think can efficiently extract records from M-to-N cheers On Wed, Jan 7, 2015 at 6:59

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
​Thanks. Another question. ​I have event data with timestamps. I want to create a sliding window using timestamps. Some windows will have a lot of events in them others won’t. Is there a way to get an RDD made of this kind of a variable length window? On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
Except I want it to be a sliding window. So the same record could be in multiple buckets. On Tue, Jan 6, 2015 at 3:43 PM, Sean Owen so...@cloudera.com wrote: So you want windows covering the same length of time, some of which will be fuller than others? You could, for example, simply bucket

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
One problem with this is that we are creating a lot of iterables containing a lot of repeated data. Is there a way to do this so that we can calculate a moving average incrementally? On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen so...@cloudera.com wrote: Yes, if you break it down to...

Re: Using ec2 launch script with locally built version of spark?

2015-01-06 Thread gen tang
Hi, As the ec2 launch script provided by spark uses https://github.com/mesos/spark-ec2 to download and configure all the tools in the cluster (spark, hadoop etc). You can create your own git repository to achieve your goal. More precisely: 1. Upload your own version of spark in s3 at address

Re: RDD Moving Average

2015-01-06 Thread Sean Owen
Interesting, I am not sure the order in which fold() encounters elements is guaranteed, although from reading the code, I imagine in practice it is first-to-last by partition and then folded first-to-last from those results on the driver. I don't know this would lead to a solution though as the

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
One approach I was considering was to use mapPartitions. It is straightforward to compute the moving average over a partition, except for near the end point. Does anyone see how to fix that? On Tue, Jan 6, 2015 at 7:20 PM, Sean Owen so...@cloudera.com wrote: Interesting, I am not sure the order

R: RDD Moving Average

2015-01-06 Thread Paolo Platter
In my opinion you should use fold pattern. Obviously after an sort by trasformation. Paolo Inviata dal mio Windows Phone Da: Asim Jalismailto:asimja...@gmail.com Inviato: ‎06/‎01/‎2015 23:11 A: Sean Owenmailto:so...@cloudera.com Cc:

How to replace user.id to user.names in a file

2015-01-06 Thread riginos
I work on a user to user recommender for a website using mllib.recommendation. I have created a file (recommends.txt) which contains the top 5 recommendations for each user id. The file's form(recommends.txt) is something like this (user::rec1:rec2:rec3:rec4:rec5): /**file's snapshot**/

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
I guess I can use a similar groupBy approach. Map each event to all the windows that it can belong to. Then do a groupBy, etc. I was wondering if there was a more elegant approach. On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis asimja...@gmail.com wrote: Except I want it to be a sliding window. So

Snappy error when driver is running in JBoss

2015-01-06 Thread Charles
I get this exception(java.lang.UnsatisfiedLinkError) when the driver is running inside JBoss. We are running with DataStax 4.6 version, which is using spark 1.1.0. The driver runs inside a wildfly container. The snappy-java version is 1.0.5. 2015-01-06 20:25:03,771 ERROR

Re: Spark 1.1.0 and HBase: Snappy UnsatisfiedLinkError

2015-01-06 Thread Charles
Hi, I am getting this same error. Did you figure out how to solve the problem? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-and-HBase-Snappy-UnsatisfiedLinkError-tp19827p21005.html Sent from the Apache Spark User List mailing list

Using ec2 launch script with locally built version of spark?

2015-01-06 Thread Ganon Pierce
Is there a way to use the ec2 launch script with a locally built version of spark? I launch and destroy clusters pretty frequently and would like to not have to wait each time for the master instance to compile the source as happens when I set the -v tag with the latest git commit. To be clear,

Re: Driver hangs on running mllib word2vec

2015-01-06 Thread Ganon Pierce
Two billion words is a very large vocabulary… You can try solving this issue by by setting the number of times words must occur in order to be included in the vocabulary using setMinCount, this will be prevent common misspellings, websites, and other things from being included and may improve

Re: Driver hangs on running mllib word2vec

2015-01-06 Thread Ganon Pierce
Oops, just kidding, this method is not in the current release. However, it is included in the latest commit on git if you want to do a build. On Jan 6, 2015, at 2:56 PM, Ganon Pierce ganon.pie...@me.com wrote: Two billion words is a very large vocabulary… You can try solving this issue by

Re: Data Locality

2015-01-06 Thread Andrew Ash
You can also read about locality here in the docs: http://spark.apache.org/docs/latest/tuning.html#data-locality On Tue, Jan 6, 2015 at 8:37 AM, Cody Koeninger c...@koeninger.org wrote: No, not all rdds have location information, and in any case tasks may be scheduled on non-local nodes if

Current Build Gives HTTP ERROR

2015-01-06 Thread Ganon Pierce
I’m attempting to build from the latest commit on git and receive the following error upon attempting to access the application web ui: HTTP ERROR: 500 Problem accessing /jobs/. Reason: Server Error Powered by Jetty:// My driver also prints this error:

Re: Launching Spark app in client mode for standalone cluster

2015-01-06 Thread Boromir Widas
Thanks for the pointers. The issue was due to route caching by Spray, which would always return the same value. Other than that the program is working fine. On Mon, Jan 5, 2015 at 12:44 AM, Simon Chan simonc...@gmail.com wrote: Boromir, You may like to take a look at how we make Spray and

Re: Current Build Gives HTTP ERROR

2015-01-06 Thread Sean Owen
FWIW I do not see any such error, after a mvn -DskipTests clean package and ./bin/spark-shell from master. Maybe double-check you have done a full clean build. On Tue, Jan 6, 2015 at 9:09 PM, Ganon Pierce ganon.pie...@me.com wrote: I’m attempting to build from the latest commit on git and

[MLLib] storageLevel in ALS

2015-01-06 Thread Fernando O.
Hi, I was doing a tests with ALS and I noticed that if I persist the inner RDDs from a MatrixFactorizationModel the RDD is not replicated, it seems like the storagelevel is hardcoded to MEMORY_AND_DISK, do you think it makes sense to make that configurable? [image: Inline image 1]

Re: Snappy error when driver is running in JBoss

2015-01-06 Thread Ted Yu
Might be due to conflict between multiple snappy jars. Can you check the classpath to see if there are more than one snappy jar ? Cheers On Tue, Jan 6, 2015 at 2:26 PM, Charles charles...@cenx.com wrote: I get this exception(java.lang.UnsatisfiedLinkError) when the driver is running inside

Re: Parquet schema changes

2015-01-06 Thread Adam Gilmore
Anyone got any further thoughts on this? I saw the _metadata file seems to store the schema of every single part (i.e. file) in the parquet directory, so in theory it should be possible. Effectively, our use case is that we have a stack of JSON that we receive and we want to encode to Parquet

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Boromir Widas
I do not understand Chinese but the diagrams on that page are very helpful. On Tue, Jan 6, 2015 at 9:46 PM, eric wong win19...@gmail.com wrote: A good beginning if you are chinese. https://github.com/JerryLead/SparkInternals/tree/master/markdown 2015-01-07 10:13 GMT+08:00 bit1...@163.com

Re: Cannot see RDDs in Spark UI

2015-01-06 Thread Andrew Ash
Hi Manoj, I've noticed that the storage tab only shows RDDs that have been cached. Did you call .cache() or .persist() on any of the RDDs? Andrew On Tue, Jan 6, 2015 at 6:48 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, I create a bunch of RDDs, including schema RDDs. When I run the

Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Tobias Pfeiffer
Hi, On Tue, Jan 6, 2015 at 11:24 PM, Todd bit1...@163.com wrote: I am a bit new to Spark, except that I tried simple things like word count, and the examples given in the spark sql programming guide. Now, I am investigating the internals of Spark, but I think I am almost lost, because I

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread eric wong
A good beginning if you are chinese. https://github.com/JerryLead/SparkInternals/tree/master/markdown 2015-01-07 10:13 GMT+08:00 bit1...@163.com bit1...@163.com: Thank you, Tobias. I will look into the Spark paper. But it looks that the paper has been moved,

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Sven Krasser
Hey Davies, Here are some more details on a configuration that causes this error for me. Launch an AWS Spark EMR cluster as follows: *aws emr create-cluster --region us-west-1 --no-auto-terminate \ --ec2-attributes KeyName=your-key-here,SubnetId=your-subnet-here \ --bootstrap-actions

Re: Parquet schema changes

2015-01-06 Thread Michael Armbrust
I want to support this but we don't yet. Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3851 On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore dragoncu...@gmail.com wrote: Anyone got any further thoughts on this? I saw the _metadata file seems to store the schema of every single

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread bit1...@163.com
Thanks Eric. Yes..I am Chinese, :-). I will read through the articles, thank you! bit1...@163.com From: eric wong Date: 2015-01-07 10:46 To: bit1...@163.com CC: user Subject: Re: Re: I think I am almost lost in the internals of Spark A good beginning if you are chinese.

SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-06 Thread Nathan McCarthy
Hi, I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via rdd.mapPartitions(…). Using the latest release 1.2.0. Simple example; load up some sample data from parquet on HDFS (about 380m rows, 10 columns) on a 7 node cluster. val t =

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Krishna Sankar
Interestingly Google Chrome translates the materials. Cheers k/ On Tue, Jan 6, 2015 at 7:26 PM, Boromir Widas vcsub...@gmail.com wrote: I do not understand Chinese but the diagrams on that page are very helpful. On Tue, Jan 6, 2015 at 9:46 PM, eric wong win19...@gmail.com wrote: A good

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, it looks to me as if you need the whole user database on every node, so maybe put the id-name information as a Map[Id, String] in a broadcast variable and then do something like recommendations.map(line = { line.map(uid = usernames(uid)) }) or so? Tobias

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 10:47 AM, Riginos Samaras samarasrigi...@gmail.com wrote: Yes something like this. Can you please give me an example to create a Map? That depends heavily on the shape of your input file. What about something like: (for (line - Source.fromFile(filename).getLines())

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 11:13 AM, Riginos Samaras samarasrigi...@gmail.com wrote: exactly thats what I'm looking for, my code is like this: //code val users_map = users_file.map{ s = val parts = s.split(,) (parts(0).toInt, parts(1)) }.distinct //code but i get the error:

Re: different akka versions and spark

2015-01-06 Thread Koert Kuipers
if the classes are in the original location than i think its safe to say that this makes it impossible for us to build one app that can run against spark 1.0.x, 1.1.x and spark 1.2.x. thats no big deal, but it does beg the question of what compatibility can reasonably be expected for spark 1.x

Re: Spark Driver behind NAT

2015-01-06 Thread Aaron
From what I can tell, this isn't a firewall issue per se..it's how the Remoting Service binds to an IP given cmd line parameters. So, if I have a VM (or OpenStack or EC2 instance) running on a private network let's say, where the IP address is 192.168.X.Y...I can't tell the Workers to reach me on

Re: Spark Driver behind NAT

2015-01-06 Thread Aaron
Found the issue in JIRA: https://issues.apache.org/jira/browse/SPARK-4389?jql=project%20%3D%20SPARK%20AND%20text%20~%20NAT On Tue, Jan 6, 2015 at 10:45 AM, Aaron aarongm...@gmail.com wrote: From what I can tell, this isn't a firewall issue per se..it's how the Remoting Service binds to an IP

Re: MLLIB and Openblas library in non-default dir

2015-01-06 Thread Xiangrui Meng
spark-submit may not share the same JVM with Spark master and executors. On Tue, Jan 6, 2015 at 11:40 AM, Tomas Hudik xhu...@gmail.com wrote: thanks Xiangrui I'll try it. BTW: spark-submit is a standalone program (bin/spark-submit). Therefore, JVM has to be executed after spark-submit

Re: [MLLib] storageLevel in ALS

2015-01-06 Thread Xiangrui Meng
Which Spark version are you using? We made this configurable in 1.1: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L202 -Xiangrui On Tue, Jan 6, 2015 at 12:57 PM, Fernando O. fot...@gmail.com wrote: Hi, I was doing a tests

Re: Using graphx to calculate average distance of a big graph

2015-01-06 Thread James
We are going to estimate the average distance using [HyperAnf]( http://arxiv.org/abs/1011.5599) on a 100 billion edge graph. 2015-01-07 2:18 GMT+08:00 Ankur Dave ankurd...@gmail.com: [-dev] What size of graph are you hoping to run this on? For small graphs where materializing the all-pairs

Re: Spark SQL implementation error

2015-01-06 Thread Pankaj Narang
As per telephonic call see how we can fetch the count val tweetsCount = sql(SELECT COUNT(*) FROM tweets) println(f\n\n\nThere are ${tweetsCount.collect.head.getLong(0)} Tweets on this Dataset\n\n) -- View this message in context:

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Daniel Haviv
Quoting Michael: Predicate push down into the input format is turned off by default because there is a bug in the current parquet library that null pointers when there are full row groups that are null. https://issues.apache.org/jira/browse/SPARK-4258 You can turn it on if you want:

Re: Problem getting Spark running on a Yarn cluster

2015-01-06 Thread Akhil Das
Just follow this documentation http://spark.apache.org/docs/1.1.1/running-on-yarn.html Ensure that *HADOOP_CONF_DIR* or *YARN_CONF_DIR* points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to the dfs and connect to

Re: confidence/probability for prediction in MLlib

2015-01-06 Thread Xiangrui Meng
This is addressed in https://issues.apache.org/jira/browse/SPARK-4789. In the new pipeline API, we can simply output two columns, one for the best predicted class, and the other for probabilities or confidence scores for each class. -Xiangrui On Tue, Jan 6, 2015 at 11:43 AM, Jianguo Li

Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Xuelin Cao
Hi,        I'm testing parquet file format, and the predicate pushdown is a very useful feature for us.        However, it looks like the predicate push down doesn't work after I set        sqlContext.sql(SET spark.sql.parquet.filterPushdown=true)        Here is my sql:       

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-06 Thread k.tham
an RDD cannot contain elements of type RDD. (i.e. you can't nest RDDs within RDDs, in fact, I don't think it makes any sense) I suggest rather than having an RDD of file names, collect those file name strings back on to the driver as a Scala array of file names, and then from there, make an array

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-06 Thread Xiangrui Meng
Could you attach the executor log? That may help identify the root cause. -Xiangrui On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com wrote: Hi All, Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in local mode and not on distributed mode. Null

Re: Parquet predicate pushdown

2015-01-06 Thread Adam Gilmore
Thanks for that. Strangely enough I was actually using 1.1.1 where it did seem to be enabled by default. Since upgrading to 1.2.0 and setting that flag, I do get the expected result! Looks good! On Tue, Jan 6, 2015 at 12:17 PM, Michael Armbrust mich...@databricks.com wrote: Predicate push

Cannot see RDDs in Spark UI

2015-01-06 Thread Manoj Samel
Hi, I create a bunch of RDDs, including schema RDDs. When I run the program and go to UI on xxx:4040, the storage tab does not shows any RDDs. Spark version is 1.1.1 (Hadoop 2.3) Any thoughts? Thanks,

RE: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Max Xu
Thanks Ted. You are right, hbase-site.xml is in the classpath. But previously I have it in the classpath too and the app works fine. I believe I found the problem. I built Spark 1.2.0 myself and forgot to change the dependency hbase version to 0.98.8-hadoop2, which is the version I use. When I

RE: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Max Xu
Awesome. Thanks again Ted. I remember there is a block in the pom.xml under the example folder that default hbase version to hadoop1. I figured out this last time when I built Spark 1.1.1 but forgot this time. profile idhbase-hadoop1/id activation property

RDD Moving Average

2015-01-06 Thread Asim Jalis
Is there an easy way to do a moving average across a single RDD (in a non-streaming app). Here is the use case. I have an RDD made up of stock prices. I want to calculate a moving average using a window size of N. Thanks. Asim

RE: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Max Xu
Issue resolved after updating the Hbase version to 0.98.8-hadoop2. Thanks Ted for all the help! For future reference: This problem has nothing to do with Spark 1.2.0 but simply because I built Spark 1.2.0 with the wrong Hbase version. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday,

Re: Data Locality

2015-01-06 Thread Cody Koeninger
No, not all rdds have location information, and in any case tasks may be scheduled on non-local nodes if there is idle capacity. see spark.locality.wait http://spark.apache.org/docs/latest/configuration.html On Tue, Jan 6, 2015 at 10:17 AM, gtinside gtins...@gmail.com wrote: Does spark

Re: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Ted Yu
I assume hbase-site.xml is in the classpath. Can you try the code snippet in standalone program to see if the problem persists ? Cheers On Tue, Jan 6, 2015 at 6:42 AM, Max Xu max...@twosigma.com wrote: Hi all, I have a Spark streaming application that ingests data from a Kafka topic and

1.2.0 - java.lang.ClassCastException: scala.Tuple2 cannot be cast to scala.collection.Iterator

2015-01-06 Thread bchazalet
I am running into the same problem describe here https://www.mail-archive.com/user%40spark.apache.org/msg17788.html which for some reasons does not appear in the archives. I am having a standalone scala application, build (using sbt) with spark jars from maven: org.apache.spark %% spark-core

Re: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Ted Yu
Default profile is hbase-hadoop1 so you need to specify -Dhbase.profile=hadoop2 See SPARK-1297 Cheers On Tue, Jan 6, 2015 at 9:11 AM, Max Xu max...@twosigma.com wrote: Thanks Ted. You are right, hbase-site.xml is in the classpath. But previously I have it in the classpath too and the app

Re: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Ted Yu
I doubt anyone would deploy hbase 0.98.x on hadoop-1 Looks like hadoop2 profile should be made the default. Cheers On Tue, Jan 6, 2015 at 9:49 AM, Max Xu max...@twosigma.com wrote: Awesome. Thanks again Ted. I remember there is a block in the pom.xml under the example folder that default

Data Locality

2015-01-06 Thread gtinside
Does spark guarantee to push the processing to the data ? Before creating tasks does spark always check for data location ? So for example if I have 3 spark nodes (Node1, Node2, Node3) and data is local to just 2 nodes (Node1 and Node2) , will spark always schedule tasks on the node for which the

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Sven Krasser
The issue has been sensitive to the number of executors and input data size. I'm using 2 executors with 4 cores each, 25GB of memory, 3800MB of memory overhead for YARN. This will fit onto Amazon r3 instance types. -Sven On Tue, Jan 6, 2015 at 12:46 AM, Davies Liu dav...@databricks.com wrote: I

Re: RDD Moving Average

2015-01-06 Thread Sean Owen
First you'd need to sort the RDD to give it a meaningful order, but I assume you have some kind of timestamp in your data you can sort on. I think you might be after the sliding() function, a developer API in MLlib:

Re: Using graphx to calculate average distance of a big graph

2015-01-06 Thread Ankur Dave
[-dev] What size of graph are you hoping to run this on? For small graphs where materializing the all-pairs shortest path is an option, you could simply find the APSP using https://github.com/apache/spark/pull/3619 and then take the average distance (apsp.map(_._2.toDouble).mean). Ankur