Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df(event) === 0) . select(country, cnt) val df_2 = df.filter( df(event) === 3) . select(country, cnt)

Re: Spark Application Hung

2015-03-25 Thread Akhil Das
In production, i'd suggest you having a High availability cluster with minimum of 3 nodes (data nodes in your case). Now lets examine your scenario: - When you suddenly brings down one of the node which has 2 executors running on it, what happens is that the node (DN2) will be having your jobs

Spark Performance -Hive or Hbase?

2015-03-25 Thread Siddharth Ubale
HI , We have started RnD on Apache Spark to use its features such as Spark-SQL Spark Streaming. I have two Pain points , can anyone of you address them which are as follows: 1. Does spark allow us the feature to fetch updated items after an RDD has been mapped and schema has been

Re: spark worker on mesos slave | possible networking config issue

2015-03-25 Thread Akhil Das
It says: ried to associate with unreachable remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: localhost/ 127.0.0.1:51849 I'd suggest you changing this property:

Re: Weird exception in Spark job

2015-03-25 Thread Akhil Das
As it says, you are having a jar conflict: java.lang.NoSuchMethodError: org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V Verify your classpath and see the netty versions Thanks Best Regards On Tue, Mar 24, 2015 at 11:07 PM, nitinkak001

Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
Hi Sparkers, I am trying to load data in spark with the following command *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt ' INTO TABLE src);* *Getting exception below* *Server IPC version 9 cannot communicate with client version 4* NOte : i am using Hadoop 2.2

Re: Optimal solution for getting the header from CSV with Spark

2015-03-25 Thread Felix C
The spark-csv package can handle header row, and the code is at the link below. It could also use the header to infer field names in the schema. https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvRelation.scala --- Original Message --- From: Dean

Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-25 Thread Akhil Das
It means you are already having 4 applications running on 4040, 4041, 4042, 4043. And that's why it was able to run on 4044. You can do a *netstat -pnat | grep 404* *And see what all processes are running. Thanks Best Regards On Wed, Mar 25, 2015 at 1:13 AM, , Roy rp...@njit.edu wrote: I get

Re: Optimal solution for getting the header from CSV with Spark

2015-03-25 Thread Spico Florin
Hello! Thank for your responses. I was afraid that due to partitioning I will loose the logic that the first element is the header. I observe that rdd.first calls behind the rdd.take(1) method. Best regards, Florin On Tue, Mar 24, 2015 at 4:41 PM, Dean Wampler deanwamp...@gmail.com wrote:

Explanation streaming-cep-engine with example

2015-03-25 Thread Dhimant
Hi, Can someone explain how spark streaming cep engine works ? How to use it with sample example? http://spark-packages.org/package/Stratio/streaming-cep-engine -- View this message in context:

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Saisai Shao
Looks like you have to build Spark with related Hadoop version, otherwise you will meet exception as mentioned. you could follow this doc: http://spark.apache.org/docs/latest/building-spark.html 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com: Hi Sparkers, I am trying to load

Spark Maven Test error

2015-03-25 Thread zzcclp
I use command to run Unit test, as follow: ./make-distribution.sh --tgz --skip-java-test -Pscala-2.10 -Phadoop-2.3 -Phive -Phive-thriftserver -Pyarn -Dyarn.version=2.3.0-cdh5.1.2 -Dhadoop.version=2.3.0-cdh5.1.2 mvn -Pscala-2.10 -Phadoop-2.3 -Phive -Phive-thriftserver -Pyarn

Exception in thread main java.lang.VerifyError: class org.apache.hadoop.yarn.proto.YarnProtos$PriorityProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;

2015-03-25 Thread Canoe
I compile spark-1.3.0 on Hadoop 2.3.0-cdh5.1.0 with protoc 2.5.0. But when I try to run the examples, it throws: Exception in thread main java.lang.VerifyError: class org.apache.hadoop.yarn.proto.YarnProtos$PriorityProto overrides final method

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
Is your spark compiled against hadoop 2.2? If not download https://spark.apache.org/downloads.html the Spark 1.2 binary with Hadoop 2.2 Thanks Best Regards On Wed, Mar 25, 2015 at 12:52 PM, sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, I am trying to load data in spark with the

Serialization Problem in Spark Program

2015-03-25 Thread donhoff_h
Hi, experts I wrote a very simple spark program to test the KryoSerialization function. The codes are as following: object TestKryoSerialization { def main(args: Array[String]) { val conf = new SparkConf() conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)

OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread SLiZn Liu
Hi, I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL and DataFrame Guide https://spark.apache.org/docs/latest/sql-programming-guide.html step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such

OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Todd Leo
Hi, I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL and DataFrame Guide https://spark.apache.org/docs/latest/sql-programming-guide.html step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh I am running the below command in spark/yarn directory where pom.xml file is available mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Please correct me if i am wrong. On Wed, Mar 25, 2015 at 12:55 PM,

Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Ted Yu
Can you try giving Spark driver more heap ? Cheers On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote: Hi, I am using Spark SQL to query on my Hive cluster, following Spark SQL and DataFrame Guide step by step. However, my HiveQL via sqlContext.sql() fails and

issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread sachin Singh
Hi , when I am submitting spark job in cluster mode getting error as under in hadoop-yarn log, someone has any idea,please suggest, 2015-03-25 23:35:22,467 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1427124496008_0028 State change from FINAL_SAVING to FAILED

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
Just run : mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package ​ Thanks Best Regards On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura sandeepv...@gmail.com wrote: Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh I am running the below command in spark/yarn

Spark-sql query got exception.Help

2015-03-25 Thread 李铖
It is ok when I do query data from a small hdfs file. But if the hdfs file is 152m,I got this exception. I try this code .'sc.setSystemProperty(spark.kryoserializer.buffer.mb,'256')'.error still. ``` com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at

How do you write Dataframes to elasticsearch

2015-03-25 Thread yamanoj
It seems that elasticsearch-spark_2.10 currently not supporting spart 1.3. Could you tell me if there is an alternative way to save Dataframes to elasticsearch? -- View this message in context:

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
Oh, in that case you should mention 2.4, If you don't want to compile spark, then you can download the precompiled version from Downloads page https://spark.apache.org/downloads.html. http://d3kbcqa49mib13.cloudfront.net/spark-1.2.0-bin-hadoop2.4.tgz Thanks Best Regards On Wed, Mar 25, 2015 at

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Sean Owen
Of course, VERSION is supposed to be replaced by a real Hadoop version! On Wed, Mar 25, 2015 at 12:04 PM, sandeep vura sandeepv...@gmail.com wrote: Build failed with following errors. I have executed the below following command. mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests

Re: spark worker on mesos slave | possible networking config issue

2015-03-25 Thread Anirudha Jadhav
is there a way to have this dynamically pick the local IP. static assignment does not work cos the workers are dynamically allocated on mesos On Wed, Mar 25, 2015 at 3:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote: It says: ried to associate with unreachable remote address

Re: spark worker on mesos slave | possible networking config issue

2015-03-25 Thread Akhil Das
Remove SPARK_LOCAL_IP then? Thanks Best Regards On Wed, Mar 25, 2015 at 6:45 PM, Anirudha Jadhav aniru...@nyu.edu wrote: is there a way to have this dynamically pick the local IP. static assignment does not work cos the workers are dynamically allocated on mesos On Wed, Mar 25, 2015 at

JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread ankur.jain
Hi, I am trying to run a Spark on YARN program provided by Spark in the examples directory using Amazon Kinesis on EMR cluster : I am using Spark 1.3.0 and EMR AMI: 3.5.0 I've setup the Credentials export AWS_ACCESS_KEY_ID=XX export AWS_SECRET_KEY=XXX *A) This is the Kinesis Word

Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread Akhil Das
You can open the Master UI running on 8080 port of your ubuntu machine and after submitting the job, you can see how many cores are being used etc from the UI. Thanks Best Regards On Wed, Mar 25, 2015 at 6:50 PM, James King jakwebin...@gmail.com wrote: Thanks Akhil, Yes indeed this is why it

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
Build failed with following errors. I have executed the below following command. * mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package* [INFO] [INFO] BUILD FAILURE [INFO]

Re: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread Arush Kharbanda
Did you built for kineses using profile *-Pkinesis-asl* On Wed, Mar 25, 2015 at 7:18 PM, ankur.jain ankur.j...@yash.com wrote: Hi, I am trying to run a Spark on YARN program provided by Spark in the examples directory using Amazon Kinesis on EMR cluster : I am using Spark 1.3.0 and EMR AMI:

Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-25 Thread , Roy
Yes I do have other application already running. Thanks for your explanation. On Wed, Mar 25, 2015 at 2:49 AM, Akhil Das ak...@sigmoidanalytics.com wrote: It means you are already having 4 applications running on 4040, 4041, 4042, 4043. And that's why it was able to run on 4044. You can

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
-D*hadoop.version=2.2* Thanks Best Regards On Wed, Mar 25, 2015 at 5:34 PM, sandeep vura sandeepv...@gmail.com wrote: Build failed with following errors. I have executed the below following command. * mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package* [INFO]

Re: 1.3 Hadoop File System problem

2015-03-25 Thread Jim Carroll
Thanks Patrick and Michael for your responses. For anyone else that runs across this problem prior to 1.3.1 being released, I've been pointed to this Jira ticket that's scheduled for 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 Thanks again. -- View this message in context:

Re: EC2 Having script run at startup

2015-03-25 Thread rahulkumar-aws
You can use AWS user-data feature. try this, if it help for you. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html - Software Developer SigmoidAnalytics, Bangalore -- View this message in context:

NetwrokWordCount + Spark standalone

2015-03-25 Thread James King
I'm trying to run the Java NetwrokWordCount example against a simple spark standalone runtime of one master and one worker. But it doesn't seem to work, the text entered on the Netcat data server is not being picked up and printed to Eclispe console output. However if I use

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread Peter Rudenko
Hi Martin, here’s 2 possibilities to overcome this: 1) Put your logic into org.apache.spark package in your project - then everything would be accessible. 2) Dirty trick: |object SparkVector extends HashingTF { val VectorUDT: DataType = outputDataType } | then you can do like this:

Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Sandy Ryza
Hi Sachin, It appears that the application master is failing. To figure out what's wrong you need to get the logs for the application master. -Sandy On Wed, Mar 25, 2015 at 7:05 AM, Sachin Singh sachin.sha...@gmail.com wrote: OS I am using Linux, when I will run simply as master yarn, its

Re: Spark-sql query got exception.Help

2015-03-25 Thread Cheng Lian
Oh, just noticed that you were calling |sc.setSystemProperty|. Actually you need to set this property in SparkConf or in spark-defaults.conf. And there are two configurations related to Kryo buffer size, * spark.kryoserializer.buffer.mb, which is the initial size, and *

foreachRDD execution

2015-03-25 Thread Luis Ángel Vicente Sánchez
I have a simple and probably dumb question about foreachRDD. We are using spark streaming + cassandra to compute concurrent users every 5min. Our batch size is 10secs and our block interval is 2.5secs. At the end of the world we are using foreachRDD to join the data in the RDD with existing data

Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread Akhil Das
Spark Streaming requires you to have minimum of 2 cores, 1 for receiving your data and the other for processing. So when you say local[2] it basically initialize 2 threads on your local machine, 1 for receiving data from network and the other for your word count processing. Thanks Best Regards

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
*I am using hadoop 2.4 should i mention -Dhadoop.version=2.2* *$ hadoop version* *Hadoop 2.4.1* *Subversion http://svn.apache.org/repos/asf/hadoop/common http://svn.apache.org/repos/asf/hadoop/common -r 1604318* *Compiled by jenkins on 2014-06-21T05:43Z* *Compiled with protoc 2.5.0* *From source

Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread James King
Thanks Akhil, Yes indeed this is why it works when using local[2] but I'm unclear of why it doesn't work when using standalone daemons? Is there way to check what cores are being seen when running against standalone daemons? I'm running the master and worker on same ubuntu host. The Driver

Re: Spark as a service

2015-03-25 Thread Irfan Ahmad
You're welcome. How did it go? *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Wed, Mar 25, 2015 at 7:53 AM, Ashish Mukherjee ashish.mukher...@gmail.com

Write Parquet File with spark-streaming with Spark 1.3

2015-03-25 Thread richiesgr
Hi I've succeed to write kafka stream to parquet file in Spark 1.2 but I can't make it with spark 1.3 As in streaming I can't use saveAsParquetFile() because I can't add data to an existing parquet File I know that it's possible to stream data directly into parquet could you help me by

Re: How do you write Dataframes to elasticsearch

2015-03-25 Thread Nick Pentreath
Spark 1.3 is not supported by elasticsearch-hadoop yet but will be very soon:  https://github.com/elastic/elasticsearch-hadoop/issues/400 However in the meantime you could use df.toRDD.saveToEs - though you may have to manipulate the Row object perhaps to extract fields, not sure if it will

Spark Streaming - Minimizing batch interval

2015-03-25 Thread RodrigoB
I've been given a feature requirement that means processing events on a latency lower than 0.25ms. Meaning I would have to make sure that Spark streaming gets new events from the messaging layer within that period of time. Would anyone have achieve such numbers using a Spark cluster? Or would

upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is

Re: Spark Streaming - Minimizing batch interval

2015-03-25 Thread Sean Owen
I don't think it's feasible to set a batch interval of 0.25ms. Even at tens of ms the overhead of the framework is a large factor. Do you mean 0.25s = 250ms? Related thoughts, and I don't know if they apply to your case: If you mean, can you just read off the source that quickly? yes. Sometimes

What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
I have a SparkSQL dataframe with a a few billion rows that I need to quickly filter down to a few hundred thousand rows, using an operation like (syntax may not be correct) df = df[ df.filter(lambda x: x.key_col in approved_keys)] I was thinking about serializing the data using parquet and

Recovered state for updateStateByKey and incremental streams processing

2015-03-25 Thread Ravi Reddy
I want to use the restore from checkpoint to continue from last accumulated word counts and process new streams of data. This recovery process will keep accurate state of accumulated counters state (calculated by updateStateByKey) after failure/recovery or temp shutdown/upgrade to new code.

python : Out of memory: Kill process

2015-03-25 Thread Eduardo Cusa
Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache()

Re: OOM for HiveFromSpark example

2015-03-25 Thread ๏̯͡๏
I am facing same issue, posted a new thread. Please respond. On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang zzh...@hortonworks.com wrote: Hi Folks, I am trying to run hive context in yarn-cluster mode, but met some error. Does anybody know what cause the issue. I use following cmd to build

Re: OOM for HiveFromSpark example

2015-03-25 Thread Zhan Zhang
I solve this by increase the PermGen memory size in driver. -XX:MaxPermSize=512m Thanks. Zhan Zhang On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.commailto:deepuj...@gmail.com wrote: I am facing same issue, posted a new thread. Please respond. On Wed, Jan 14, 2015 at 4:38 AM,

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Unable to Hive program from Spark Programming Guide (OutOfMemoryError)

2015-03-25 Thread ๏̯͡๏
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables I modified the Hive query but run into same error. ( http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

Re: OutOfMemory : Java heap space error

2015-03-25 Thread ๏̯͡๏
I am facing same issue, posted a new thread. Please respond. On Wed, Jul 9, 2014 at 1:56 AM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Hi, My code was running properly but then it suddenly gave this error. Can you just put some light on it. ### 0 KB, free: 38.7

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile  — Sent from Mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project .

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath

RE: Date and decimal datatype not working

2015-03-25 Thread BASAK, ANANDA
Thanks. This library is only available with Spark 1.3. I am using version 1.2.1. Before I upgrade to 1.3, I want to try what can be done in 1.2.1. So I am using following: val MyDataset = sqlContext.sql(my select query”) MyDataset.map(t =

Re: Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Denny Lee
As you noted, you can change the spark.driver.maxResultSize value in your Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html). Please reference the Spark Properties section noting that you can modify these properties via the spark-defaults.conf or via SparkConf(). HTH!

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-25 Thread ๏̯͡๏
I have a YARN cluster where the max memory allowed is 16GB. I set 12G for my driver, however i see OutOFMemory error even for this program http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables . What do you suggest ? On Wed, Mar 25, 2015 at 8:23 AM, Thomas Gerber

Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
Unfortunately you are now hitting a bug (that is fixed in master and will be released in 1.3.1 hopefully next week). However, even with that your query is still ambiguous and you will need to use aliases: val df_1 = df.filter( df(event) === 0) . select(country, cnt).as(a) val

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
Ah I see now you are trying to use a spark 1.2 cluster - you will need to be running spark 1.3 on your EC2 cluster in order to run programs built against spark 1.3. You will need to terminate and restart your cluster with spark 1.3  — Sent from Mailbox On Wed, Mar 25, 2015 at 6:39 PM,

[no subject]

2015-03-25 Thread Himanish Kushary
Hi, I have a RDD of pairs of strings like below : (A,B) (B,C) (C,D) (A,D) (E,F) (B,F) I need to transform/filter this into a RDD of pairs that does not repeat a string once it has been used once. So something like , (A,B) (C,D) (E,F) (B,C) is out because B has already ben used in (A,B), (A,D)

Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Michael Armbrust
You should also try increasing the perm gen size: -XX:MaxPermSize=512m On Wed, Mar 25, 2015 at 2:37 AM, Ted Yu yuzhih...@gmail.com wrote: Can you try giving Spark driver more heap ? Cheers On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote: Hi, I am using *Spark SQL*

Re:

2015-03-25 Thread Nathan Kronenfeld
What would it do with the following dataset? (A, B) (A, C) (B, D) On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary himan...@gmail.com wrote: Hi, I have a RDD of pairs of strings like below : (A,B) (B,C) (C,D) (A,D) (E,F) (B,F) I need to transform/filter this into a RDD of pairs

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
The only way to do in using python currently is to use the string based filter API (where you pass us an expression as a string, and we parse it using our SQL parser). from pyspark.sql import Row from pyspark.sql.functions import * df = sc.parallelize([Row(name=test)]).toDF() df.filter(name in

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Thanks Dean and Nick. So, I removed the ADAM and H2o from my SBT as I was not using them. I got the code to compile - only for fail while running with - SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21 Exception in thread main java.lang.NoClassDefFoundError:

Re:

2015-03-25 Thread Himanish Kushary
It will only give (A,B). I am generating the pair from combinations of the the strings A,B,C and D, so the pairs (ignoring order) would be (A,B),(A,C),(A,D),(B,C),(B,D),(C,D) On successful filtering using the original condition it will transform to (A,B) and (C,D) On Wed, Mar 25, 2015 at 3:00

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
Weird. Are you running using SBT console? It should have the spark-core jar on the classpath. Similarly, spark-shell or spark-submit should work, but be sure you're using the same version of Spark when running as when compiling. Also, you might need to add spark-sql to your SBT dependencies, but

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread EcoMotto Inc.
Hello Jeremy, Sorry for the delayed reply! First issue was resolved, I believe it was just production and consumption rate problem. Regarding the second question, I am streaming the data from the file and there are about 38k records. I am sending the streams in the same sequence as I am reading

Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
Until then you can try sql(SET spark.sql.parquet.useDataSourceApi=false) On Wed, Mar 25, 2015 at 12:15 PM, Michael Armbrust mich...@databricks.com wrote: This will be fixed in Spark 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 and is fixed in master/branch-1.3 if you want to

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Marcelo Vanzin
The probably means there are not enough free resources in your cluster to run the AM for the Spark job. Check your RM's web ui to see the resources you have available. On Wed, Mar 25, 2015 at 12:08 PM, Khandeshi, Ami ami.khande...@fmr.com.invalid wrote: I am seeing the same behavior. I have

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
My example is a totally reasonable way to do it, it just requires constructing strings In many cases you can also do it with column objects df[df.name == test].collect() Out[15]: [Row(name=u'test')] You should check out:

Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
Hi, Thanks for your response. I am not clear about why the query is ambiguous. val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I thought df_2(country)===df_1(country) indicates that the country field in the 2 dataframes should match and df_2(country) is the equivalent of

Re: python : Out of memory: Kill process

2015-03-25 Thread Eduardo Cusa
Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies

Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Khandeshi, Ami
I am seeing the same behavior. I have enough resources. How do I resolve it? Thanks, Ami

Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
This will be fixed in Spark 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 and is fixed in master/branch-1.3 if you want to compile from source On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton stuart.lay...@gmail.com wrote: I'm trying to save a dataframe to s3 as a parquet file but I'm

writing DStream RDDs to the same file

2015-03-25 Thread Adrian Mocanu
Hi Is there a way to write all RDDs in a DStream to the same file? I tried this and got an empty file. I think it's bc the file is not closed i.e. ESMinibatchFunctions.writer.close() executes before the stream is created. Here's my code myStream.foreachRDD(rdd = { rdd.foreach(x = {

Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, I running the following

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
Thanks for the response, I was using IN as an example of the type of operation I need to do. Is there another way to do this that lines up more naturally with the way things are supposed to be done in SparkSQL? On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust mich...@databricks.com wrote: The

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-25 Thread Matt Silvey
This is a different kind of error. Thomas' OOM error was specific to the kernel refusing to create another thread/process for his application. Matthew On Wed, Mar 25, 2015 at 10:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have a YARN cluster where the max memory allowed is 16GB. I set

Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Stuart Layton
I'm trying to save a dataframe to s3 as a parquet file but I'm getting Wrong FS errors df.saveAsParquetFile(parquetFile) 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called with curMem=82744, maxMem=278302556 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5

Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
With batchSize = 1, I think it will become even worse. I'd suggest to go with 1.3, have a taste for the new DataFrame API. On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use

Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
Thats a good question. In this particular example, it is really only internal implementation details that make it ambiguous. However, fixing this was a very large change so we have defered it to Spark 1.4 and instead print a warning now when you construct trivially equal expressions. I can try

Re: newbie quesiton - spark with mesos

2015-03-25 Thread Dean Wampler
I think the problem is the use the loopback address: export SPARK_LOCAL_IP=127.0.0.1 In the stack trace from the slave, you see this: ... Reason: Connection refused: localhost/127.0.0.1:51849 akka.actor.ActorNotFound: Actor not found for:

Re: Date and decimal datatype not working

2015-03-25 Thread Dean Wampler
Recall that the input isn't actually read until to do something that forces evaluation, like call saveAsTextFile. You didn't show the whole stack trace here, but it probably occurred while parsing an input line where one of your long fields is actually an empty string. Because this is such a

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
Yes, that's the problem. The RDD class exists in both binary jar files, but the signatures probably don't match. The bottom line, as always for tools like this, is that you can't mix versions. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Exception Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration

2015-03-25 Thread varvind
Hi,I am running spark in mesos and getting this error. Can anyone help me resolve this?Thanks15/03/25 21:05:00 ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exceptionjava.lang.reflect.InvocationTargetExceptionat sun.reflect.GeneratedMethodAccessor12.invoke(Unknown

trouble with jdbc df in python

2015-03-25 Thread elliott cordo
if i run the following: db = sqlContext.load(jdbc, url=jdbc:postgresql://localhost/xx, dbtables=mstr.d_customer) i get the error: py4j.protocol.Py4JJavaError: An error occurred while calling o28.load. : java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does not exist Seems to

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
My cluster is still on spark 1.2 and in SBT I am using 1.3. So probably it is compiling with 1.3 but running with 1.2 ? On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler deanwamp...@gmail.com wrote: Weird. Are you running using SBT console? It should have the spark-core jar on the classpath.

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread DB Tsai
Hi Arunkumar, I think L-BFGS will not work since L-BFGS algorithm assumes that the objective function will be always the same (i.e., the data is the same) for entire optimization process to construct the approximated Hessian matrix. In the streaming case, the data will be changing, so it will

Re: trouble with jdbc df in python

2015-03-25 Thread Michael Armbrust
Try: db = sqlContext.load(source=jdbc, url=jdbc:postgresql://localhost/xx, dbtables=mstr.d_customer) On Wed, Mar 25, 2015 at 2:19 PM, elliott cordo elliottco...@gmail.com wrote: if i run the following: db = sqlContext.load(jdbc, url=jdbc:postgresql://localhost/xx, dbtables=mstr.d_customer)

Re: How to specify the port for AM Actor ...

2015-03-25 Thread Shixiong Zhu
There is no configuration for it now. Best Regards, Shixiong Zhu 2015-03-26 7:13 GMT+08:00 Manoj Samel manojsamelt...@gmail.com: There may be firewall rules limiting the ports between host running spark and the hadoop cluster. In that case, not all ports are allowed. Can it be a range of

Re: trouble with jdbc df in python

2015-03-25 Thread Michael Armbrust
Thanks for following up. I'll fix the docs. On Wed, Mar 25, 2015 at 4:04 PM, elliott cordo elliottco...@gmail.com wrote: Thanks!.. the below worked: db = sqlCtx.load(source=jdbc, url=jdbc:postgresql://localhost/x?user=xpassword=x,dbtable=mstr.d_customer) Note that

Re: filter expression in API document for DataFrame

2015-03-25 Thread Michael Armbrust
Yeah sorry, this is already fixed but we need to republish the docs. I'll add both of the following do work: people.filter(age 30) people.filter(people(age) 30) On Tue, Mar 24, 2015 at 7:11 PM, SK skrishna...@gmail.com wrote: The following statement appears in the Scala API example at

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Is there any way that I can install the new one and remove previous version. I installed spark 1.3 on my EC2 master and set teh spark home to the new one. But when I start teh spark-shell I get - java.lang.UnsatisfiedLinkError: org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
Thanks Peter, I ended up doing something similar. I however consider both the approaches you mentioned bad practices which is why I was looking for a solution directly supported by the current code. I can work with that now, but it does not seem to be the proper solution. Regards,

Cross-compatibility of YARN shuffle service

2015-03-25 Thread Matt Cheah
Hi everyone, I am considering moving from Spark-Standalone to YARN. The context is that there are multiple Spark applications that are using different versions of Spark that all want to use the same YARN cluster. My question is: if I use a single Spark YARN shuffle service jar on the Node

  1   2   >