Re: Serialization Problem in Spark Program

2015-03-25 Thread Akhil Das
Try registering your MyObject[] with Kryo. On 25 Mar 2015 13:17, "donhoff_h" <165612...@qq.com> wrote: > Hi, experts > > I wrote a very simple spark program to test the KryoSerialization > function. The codes are as following: > > object TestKryoSerialization { > def main(args: Array[String]) {

Re: How to troubleshoot server.TransportChannelHandler Exception

2015-03-25 Thread Akhil Das
Whats your spark version? Not quiet sure, but you could be hitting this issue https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4516 On 26 Mar 2015 11:01, "Xi Shen" wrote: > Hi, > > My environment is Windows 64bit, Spark + YARN. I had a job that takes a > long time. It starts well

Re: OOM for HiveFromSpark example

2015-03-25 Thread Akhil Das
Try to give the complete path to the file kv1.txt. On 26 Mar 2015 11:48, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: > I am now seeing this error. > > > > > > 15/03/25 19:44:03 ERROR yarn.ApplicationMaster: User class threw > exception: FAILED: SemanticException Line 1:23 Invalid path > ''examples/src/main/resources/

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-25 Thread Nick Pentreath
>From a quick look at this link - http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it seems you need to call some static methods on AccumuloInputFormat in order to set the auth, table, and range settings. Try setting these config options first and then call newAPIHadoopRDD? On

Re: OOM for HiveFromSpark example

2015-03-25 Thread ๏̯͡๏
I am now seeing this error. 15/03/25 19:44:03 ERROR yarn.ApplicationMaster: User class threw exception: FAILED: SemanticException Line 1:23 Invalid path ''examples/src/main/resources/kv1.txt'': No files matching path file:/hadoop/10/scratch/local/usercache/dvasthimal/appcache/application_14267

Re: Which OutputCommitter to use for S3?

2015-03-25 Thread Pei-Lun Lee
I updated the PR for SPARK-6352 to be more like SPARK-3595. I added a new setting "spark.sql.parquet.output.committer.class" in hadoop configuration to allow custom implementation of ParquetOutputCommitter. Can someone take a look at the PR? On Mon, Mar 16, 2015 at 5:23 PM, Pei-Lun Lee wrote: >

How to troubleshoot server.TransportChannelHandler Exception

2015-03-25 Thread Xi Shen
Hi, My environment is Windows 64bit, Spark + YARN. I had a job that takes a long time. It starts well, but it ended with below exception: 15/03/25 12:39:09 WARN server.TransportChannelHandler: Exception in connection from headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.68.34:585

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Jörn Franke
You probably need to add the dll directory to the path (not classpath!) environment variable on all nodes. Le 26 mars 2015 06:23, "Xi Shen" a écrit : > Not of course...all machines in HDInsight are Windows 64bit server. And I > have made sure all my DLLs are for 64bit machines. I have managed to

RE: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread Ankur Jain
I had installed spark via bootstrap in EMR. https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark However when I run spark without yarn (local) and that one is working fine….. Thanks Ankur From: Arush Kharbanda [mailto:ar...@sigmoidanalytics.com] Sent: Wednesday, March 25, 2015 7:3

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Xi Shen
Not of course...all machines in HDInsight are Windows 64bit server. And I have made sure all my DLLs are for 64bit machines. I have managed to get those DLLs loade on my local machine which is also Windows 64bit. [image: --] Xi Shen [image: http://]about.me/davidshen

SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-25 Thread Pei-Lun Lee
Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit of _common_metadata? Will reading performs better when it is present? Thanks, -- Pei-Lun

Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Denny Lee
Perhaps this email reference may be able to help from a DataFrame perspective: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201503.mbox/%3CCALte62ztepahF=5hk9rcfbnyk4z43wkcq4fkdcbwmgf_3_o...@mail.gmail.com%3E On Wed, Mar 25, 2015 at 7:29 PM Haopu Wang wrote: > Hi, > > > > I ha

Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Corey Nolet
I would do sum square. This would allow you to keep an ongoing value as an associative operation (in an aggregator) and then calculate the variance & std deviation after the fact. On Wed, Mar 25, 2015 at 10:28 PM, Haopu Wang wrote: > Hi, > > > > I have a DataFrame object and I want to do types

Re: Spark-sql query got exception.Help

2015-03-25 Thread Saisai Shao
Would you mind running again to see if this exception can be reproduced again, since exception in MapOutputTracker seldom occurs, maybe some other exceptions which lead to this error. Thanks Jerry 2015-03-26 10:55 GMT+08:00 李铖 : > One more exception.How to fix it .Anybody help me ,please. > > >

Re: Spark-sql query got exception.Help

2015-03-25 Thread 李铖
One more exception.How to fix it .Anybody help me ,please. org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386) at org

The dreaded bradcast error Error: Failed to get broadcast_0_piece0 of broadcast_0

2015-03-25 Thread rkgurram
val transArray:RDD[Flow] < Flow is my custom class, it has the following methods getFlowStart() <---returns a Double (start time) getFlowEnd() <---returns a Double (end time)

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
You could stop the running the processes and run the same processes using the new version, starting with the master and then the slaves. You would have to snoop around a bit to get the command-line arguments right, but it's doable. Use `ps -efw` to find the command-lines used. Be sure to rerun them

Re: Spark-sql query got exception.Help

2015-03-25 Thread 李铖
Yes, it works after I append the two properties in spark-defaults.conf. As I use python programing on spark platform,the python api does not have SparkConf api. Thanks. 2015-03-25 21:07 GMT+08:00 Cheng Lian : > Oh, just noticed that you were calling sc.setSystemProperty. Actually > you need t

Re: OOM for HiveFromSpark example

2015-03-25 Thread Zhan Zhang
You can do it in $SPARK_HOME/conf/spark-defaults.con spark.driver.extraJavaOptions -XX:MaxPermSize=512m Thanks. Zhan Zhang On Mar 25, 2015, at 7:25 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) mailto:deepuj...@gmail.com>> wrote: Where and how do i pass this or other JVM argument ? -XX:MaxPermSize=512m On Wed, Mar 25,

[SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Haopu Wang
Hi, I have a DataFrame object and I want to do types of aggregations like count, sum, variance, stddev, etc. DataFrame has DSL to do simple aggregations like count and sum. How about variance and stddev? Thank you for any suggestions!

Re: OOM for HiveFromSpark example

2015-03-25 Thread ๏̯͡๏
Where and how do i pass this or other JVM argument ? -XX:MaxPermSize=512m On Wed, Mar 25, 2015 at 11:36 PM, Zhan Zhang wrote: > I solve this by increase the PermGen memory size in driver. > > -XX:MaxPermSize=512m > > Thanks. > > Zhan Zhang > > On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) w

Re: Unable to Hive program from Spark Programming Guide (OutOfMemoryError)

2015-03-25 Thread ๏̯͡๏
Can someone please respond to this ? On Wed, Mar 25, 2015 at 11:18 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables > > > > I modified the Hive query but run into same error. ( > http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hiv

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Dean Chen
We had a similar problem. Turned out that the Spark driver was binding to the external IP of the CLI node Spark shell was running on, causing executors to fail to connect to the driver. The solution was to override "export SPARK_LOCAL_IP=" in spark-env.sh to the internal IP of the CLI node. -- D

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Tobias Pfeiffer
Hi, On Thu, Mar 26, 2015 at 4:08 AM, Khandeshi, Ami < ami.khande...@fmr.com.invalid> wrote: > I am seeing the same behavior. I have enough resources….. > CPU *and* memory are sufficient? No previous (unfinished) jobs eating them? Tobias

Re: Serialization Problem in Spark Program

2015-03-25 Thread Imran Rashid
you also need to register *array*s of MyObject. so change: conf.registerKryoClasses(Array(classOf[MyObject])) to conf.registerKryoClasses(Array(classOf[MyObject], classOf[Array[MyObject]])) On Wed, Mar 25, 2015 at 2:44 AM, donhoff_h <165612...@qq.com> wrote: > Hi, experts > > I wrote a very

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-25 Thread David Holiday
hi Irfan, thanks for getting back to me - i'll try the accumulo list to be sure. what is the normal use case for spark though? I'm surprised that hooking it into something as common and popular as accumulo isn't more of an every-day task. DAVID HOLIDAY Software Engineer 760 607 3300 | Office 31

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-25 Thread DB Tsai
We fixed couple issues in breeze LBFGS implementation. Can you try Spark 1.3 and see if they still exist? Thanks. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Mon, Mar 16, 2015 at 12:48 PM, Chang-Jia Wang wrote: > I just used random

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-25 Thread Irfan Ahmad
Hmmm this seems very accumulo-specific, doesn't it? Not sure how to help with that. *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Tue, Mar 24, 201

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread DB Tsai
Are you deploying the windows dll to linux machine? Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen wrote: > I think you meant to use the "--files" to deploy the DLLs. I gave a try, but > it did no

Re: filter expression in API document for DataFrame

2015-03-25 Thread Michael Armbrust
Yeah sorry, this is already fixed but we need to republish the docs. I'll add both of the following do work: people.filter("age > 30") people.filter(people("age") > 30) On Tue, Mar 24, 2015 at 7:11 PM, SK wrote: > > > The following statement appears in the Scala API example at > > https://sp

Re: trouble with jdbc df in python

2015-03-25 Thread Michael Armbrust
Thanks for following up. I'll fix the docs. On Wed, Mar 25, 2015 at 4:04 PM, elliott cordo wrote: > Thanks!.. the below worked: > > db = sqlCtx.load(source="jdbc", > url="jdbc:postgresql://localhost/x?user=x&password=x",dbtable="mstr.d_customer") > > Note that > https://spark.apache.org/docs/la

Re: How to specify the port for AM Actor ...

2015-03-25 Thread Shixiong Zhu
There is no configuration for it now. Best Regards, Shixiong Zhu 2015-03-26 7:13 GMT+08:00 Manoj Samel : > There may be firewall rules limiting the ports between host running spark > and the hadoop cluster. In that case, not all ports are allowed. > > Can it be a range of ports that can be speci

Re: How to specify the port for AM Actor ...

2015-03-25 Thread Manoj Samel
There may be firewall rules limiting the ports between host running spark and the hadoop cluster. In that case, not all ports are allowed. Can it be a range of ports that can be specified ? On Wed, Mar 25, 2015 at 4:06 PM, Shixiong Zhu wrote: > It's a random port to avoid port conflicts, since

Re: How to specify the port for AM Actor ...

2015-03-25 Thread Shixiong Zhu
It's a random port to avoid port conflicts, since multiple AMs can run in the same machine. Why do you need a fixed port? Best Regards, Shixiong Zhu 2015-03-26 6:49 GMT+08:00 Manoj Samel : > Spark 1.3, Hadoop 2.5, Kerbeors > > When running spark-shell in yarn client mode, it shows following mess

Re: trouble with jdbc df in python

2015-03-25 Thread elliott cordo
Thanks!.. the below worked: db = sqlCtx.load(source="jdbc", url="jdbc:postgresql://localhost/x?user=x&password=x",dbtable="mstr.d_customer") Note that https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations needs to be updated: [image: Inline image 1] On Wed, Mar 25

How to specify the port for AM Actor ...

2015-03-25 Thread Manoj Samel
Spark 1.3, Hadoop 2.5, Kerbeors When running spark-shell in yarn client mode, it shows following message with a random port every time (44071 in example below). Is there a way to specify that port to a specific port ? It does not seem to be part of ports specified in http://spark.apache.org/docs/l

Cross-compatibility of YARN shuffle service

2015-03-25 Thread Matt Cheah
Hi everyone, I am considering moving from Spark-Standalone to YARN. The context is that there are multiple Spark applications that are using different versions of Spark that all want to use the same YARN cluster. My question is: if I use a single Spark YARN shuffle service jar on the Node Manager

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
Thanks Peter, I ended up doing something similar. I however consider both the approaches you mentioned bad practices which is why I was looking for a solution directly supported by the current code. I can work with that now, but it does not seem to be the proper solution. Regards, Mar

Re: trouble with jdbc df in python

2015-03-25 Thread Michael Armbrust
Try: db = sqlContext.load(source="jdbc", url="jdbc:postgresql://localhost/xx", dbtables="mstr.d_customer") On Wed, Mar 25, 2015 at 2:19 PM, elliott cordo wrote: > if i run the following: > > db = sqlContext.load("jdbc", url="jdbc:postgresql://localhost/xx", > dbtables="mstr.d_customer") > > i

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Is there any way that I can install the new one and remove previous version. I installed spark 1.3 on my EC2 master and set teh spark home to the new one. But when I start teh spark-shell I get - java.lang.UnsatisfiedLinkError: org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread DB Tsai
Hi Arunkumar, I think L-BFGS will not work since L-BFGS algorithm assumes that the objective function will be always the same (i.e., the data is the same) for entire optimization process to construct the approximated Hessian matrix. In the streaming case, the data will be changing, so it will caus

Exception Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration

2015-03-25 Thread varvind
Hi,I am running spark in mesos and getting this error. Can anyone help me resolve this?Thanks15/03/25 21:05:00 ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exceptionjava.lang.reflect.InvocationTargetExceptionat sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Sour

trouble with jdbc df in python

2015-03-25 Thread elliott cordo
if i run the following: db = sqlContext.load("jdbc", url="jdbc:postgresql://localhost/xx", dbtables="mstr.d_customer") i get the error: py4j.protocol.Py4JJavaError: An error occurred while calling o28.load. : java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does not exist Seem

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
Yes, that's the problem. The RDD class exists in both binary jar files, but the signatures probably don't match. The bottom line, as always for tools like this, is that you can't mix versions. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: Date and decimal datatype not working

2015-03-25 Thread Dean Wampler
Recall that the input isn't actually read until to do something that forces evaluation, like call saveAsTextFile. You didn't show the whole stack trace here, but it probably occurred while parsing an input line where one of your long fields is actually an empty string. Because this is such a commo

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
My cluster is still on spark 1.2 and in SBT I am using 1.3. So probably it is compiling with 1.3 but running with 1.2 ? On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler wrote: > Weird. Are you running using SBT console? It should have the spark-core > jar on the classpath. Similarly, spark-shell o

writing DStream RDDs to the same file

2015-03-25 Thread Adrian Mocanu
Hi Is there a way to write all RDDs in a DStream to the same file? I tried this and got an empty file. I think it's bc the file is not closed i.e. ESMinibatchFunctions.writer.close() executes before the stream is created. Here's my code myStream.foreachRDD(rdd => { rdd.foreach(x => {

Re: newbie quesiton - spark with mesos

2015-03-25 Thread Dean Wampler
I think the problem is the use the loopback address: export SPARK_LOCAL_IP=127.0.0.1 In the stack trace from the slave, you see this: ... Reason: Connection refused: localhost/127.0.0.1:51849 akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:5

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread EcoMotto Inc.
Hello Jeremy, Sorry for the delayed reply! First issue was resolved, I believe it was just production and consumption rate problem. Regarding the second question, I am streaming the data from the file and there are about 38k records. I am sending the streams in the same sequence as I am reading

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
Weird. Are you running using SBT console? It should have the spark-core jar on the classpath. Similarly, spark-shell or spark-submit should work, but be sure you're using the same version of Spark when running as when compiling. Also, you might need to add spark-sql to your SBT dependencies, but th

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
My example is a totally reasonable way to do it, it just requires constructing strings In many cases you can also do it with column objects df[df.name == "test"].collect() Out[15]: [Row(name=u'test')] You should check out: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Marcelo Vanzin
The probably means there are not enough free resources in your cluster to run the AM for the Spark job. Check your RM's web ui to see the resources you have available. On Wed, Mar 25, 2015 at 12:08 PM, Khandeshi, Ami wrote: > I am seeing the same behavior. I have enough resources….. How do I re

Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
Until then you can try sql("SET spark.sql.parquet.useDataSourceApi=false") On Wed, Mar 25, 2015 at 12:15 PM, Michael Armbrust wrote: > This will be fixed in Spark 1.3.1: > https://issues.apache.org/jira/browse/SPARK-6351 > > and is fixed in master/branch-1.3 if you want to compile from source >

Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
This will be fixed in Spark 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 and is fixed in master/branch-1.3 if you want to compile from source On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton wrote: > I'm trying to save a dataframe to s3 as a parquet file but I'm getting > Wrong FS err

Re:

2015-03-25 Thread Himanish Kushary
It will only give (A,B). I am generating the pair from combinations of the the strings A,B,C and D, so the pairs (ignoring order) would be (A,B),(A,C),(A,D),(B,C),(B,D),(C,D) On successful filtering using the original condition it will transform to (A,B) and (C,D) On Wed, Mar 25, 2015 at 3:00 PM

Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Khandeshi, Ami
I am seeing the same behavior. I have enough resources. How do I resolve it? Thanks, Ami

Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
Thats a good question. In this particular example, it is really only internal implementation details that make it ambiguous. However, fixing this was a very large change so we have defered it to Spark 1.4 and instead print a warning now when you construct trivially equal expressions. I can try t

Re:

2015-03-25 Thread Nathan Kronenfeld
What would it do with the following dataset? (A, B) (A, C) (B, D) On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary wrote: > Hi, > > I have a RDD of pairs of strings like below : > > (A,B) > (B,C) > (C,D) > (A,D) > (E,F) > (B,F) > > I need to transform/filter this into a RDD of pairs that does

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Thanks Dean and Nick. So, I removed the ADAM and H2o from my SBT as I was not using them. I got the code to compile - only for fail while running with - SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache

Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Stuart Layton
I'm trying to save a dataframe to s3 as a parquet file but I'm getting Wrong FS errors >>> df.saveAsParquetFile(parquetFile) 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called with curMem=82744, maxMem=278302556 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 s

Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
With batchSize = 1, I think it will become even worse. I'd suggest to go with 1.3, have a taste for the new DataFrame API. On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa wrote: > Hi Davies, I running 1.1.0. > > Now I'm following this thread that recommend use batchsize parameter = 1 > > > http:/

Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
Hi, Thanks for your response. I am not clear about why the query is ambiguous. val both = df_2.join(df_1, df_2("country")===df_1("country"), "left_outer") I thought df_2("country")===df_1("country") indicates that the country field in the 2 dataframes should match and df_2("country") is the equ

Re: python : Out of memory: Kill process

2015-03-25 Thread Eduardo Cusa
Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Li

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-25 Thread Matt Silvey
This is a different kind of error. Thomas' OOM error was specific to the kernel refusing to create another thread/process for his application. Matthew On Wed, Mar 25, 2015 at 10:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I have a YARN cluster where the max memory allowed is 16GB. I set 12G for > my driver,

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
Thanks for the response, I was using IN as an example of the type of operation I need to do. Is there another way to do this that lines up more naturally with the way things are supposed to be done in SparkSQL? On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust wrote: > The only way to do "in" us

Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa wrote: > Hi Guys, I running the following function with spark-submmit and de SO is

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
The only way to do "in" using python currently is to use the string based filter API (where you pass us an expression as a string, and we parse it using our SQL parser). from pyspark.sql import Row from pyspark.sql.functions import * df = sc.parallelize([Row(name="test")]).toDF() df.filter("name

Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Michael Armbrust
You should also try increasing the perm gen size: -XX:MaxPermSize=512m On Wed, Mar 25, 2015 at 2:37 AM, Ted Yu wrote: > Can you try giving Spark driver more heap ? > > Cheers > > > > On Mar 25, 2015, at 2:14 AM, Todd Leo wrote: > > Hi, > > I am using *Spark SQL* to query on my *Hive cluster*, f

Re: OOM for HiveFromSpark example

2015-03-25 Thread Zhan Zhang
I solve this by increase the PermGen memory size in driver. -XX:MaxPermSize=512m Thanks. Zhan Zhang On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) mailto:deepuj...@gmail.com>> wrote: I am facing same issue, posted a new thread. Please respond. On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang mailt

Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
Unfortunately you are now hitting a bug (that is fixed in master and will be released in 1.3.1 hopefully next week). However, even with that your query is still ambiguous and you will need to use aliases: val df_1 = df.filter( df("event") === 0) . select("country", "cnt").as("a"

Re: OutOfMemory : Java heap space error

2015-03-25 Thread ๏̯͡๏
I am facing same issue, posted a new thread. Please respond. On Wed, Jul 9, 2014 at 1:56 AM, Rahul Bhojwani wrote: > Hi, > > My code was running properly but then it suddenly gave this error. Can you > just put some light on it. > > ### > 0 KB, free: 38.7 MB) > 14/07/09 01:46

Re: OOM for HiveFromSpark example

2015-03-25 Thread ๏̯͡๏
I am facing same issue, posted a new thread. Please respond. On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang wrote: > Hi Folks, > > I am trying to run hive context in yarn-cluster mode, but met some error. > Does anybody know what cause the issue. > > I use following cmd to build the distribution: >

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-25 Thread ๏̯͡๏
I have a YARN cluster where the max memory allowed is 16GB. I set 12G for my driver, however i see OutOFMemory error even for this program http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables . What do you suggest ? On Wed, Mar 25, 2015 at 8:23 AM, Thomas Gerber wrote: > So,

Unable to Hive program from Spark Programming Guide (OutOfMemoryError)

2015-03-25 Thread ๏̯͡๏
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables I modified the Hive query but run into same error. ( http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREAT

Re: Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Denny Lee
As you noted, you can change the spark.driver.maxResultSize value in your Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html). Please reference the Spark Properties section noting that you can modify these properties via the spark-defaults.conf or via SparkConf(). HTH!

python : Out of memory: Kill process

2015-03-25 Thread Eduardo Cusa
Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() out=self.sqlConte

Recovered state for updateStateByKey and incremental streams processing

2015-03-25 Thread Ravi Reddy
I want to use the "restore from checkpoint" to continue from last accumulated word counts and process new streams of data. This recovery process will keep accurate state of accumulated counters state (calculated by updateStateByKey) after "failure/recovery" or "temp shutdown/upgrade to new code".

[no subject]

2015-03-25 Thread Himanish Kushary
Hi, I have a RDD of pairs of strings like below : (A,B) (B,C) (C,D) (A,D) (E,F) (B,F) I need to transform/filter this into a RDD of pairs that does not repeat a string once it has been used once. So something like , (A,B) (C,D) (E,F) (B,C) is out because B has already ben used in (A,B), (A,D)

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

RE: Date and decimal datatype not working

2015-03-25 Thread BASAK, ANANDA
Thanks. This library is only available with Spark 1.3. I am using version 1.2.1. Before I upgrade to 1.3, I want to try what can be done in 1.2.1. So I am using following: val MyDataset = sqlContext.sql("my select query”) MyDataset.map(t => t(0)+"|"+t(1)+"|"+t(2)+"|"+t(3)+"|"+t(4)+"|"+t(5)).sav

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
Ah I see now you are trying to use a spark 1.2 cluster - you will need to be running spark 1.3 on your EC2 cluster in order to run programs built against spark 1.3. You will need to terminate and restart your cluster with spark 1.3  — Sent from Mailbox On Wed, Mar 25, 2015 at 6:39 PM, ron

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath wrote: > What versio

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile  — Sent from Mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni wrote: > I have a EC2 cluster created using spark version 1.2.1. > And I have a SBT project . > Now I want to upg

upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what

Re: Spark as a service

2015-03-25 Thread Irfan Ahmad
You're welcome. How did it go? *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Wed, Mar 25, 2015 at 7:53 AM, Ashish Mukherjee < ashish.mukher...@gmail.c

Write Parquet File with spark-streaming with Spark 1.3

2015-03-25 Thread richiesgr
Hi I've succeed to write kafka stream to parquet file in Spark 1.2 but I can't make it with spark 1.3 As in streaming I can't use saveAsParquetFile() because I can't add data to an existing parquet File I know that it's possible to stream data directly into parquet could you help me by providing

Re: Spark Streaming - Minimizing batch interval

2015-03-25 Thread Sean Owen
I don't think it's feasible to set a batch interval of 0.25ms. Even at tens of ms the overhead of the framework is a large factor. Do you mean 0.25s = 250ms? Related thoughts, and I don't know if they apply to your case: If you mean, can you just read off the source that quickly? yes. Sometimes

Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Wang, Ningjun (LNG-NPV)
Hi I ran a spark job and got the following error. Can anybody tell me how to work around this problem? For example how can I increase spark.driver.maxResultSize? Thanks. org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 128 tasks (1029.1 MB)

Spark Streaming - Minimizing batch interval

2015-03-25 Thread RodrigoB
I've been given a feature requirement that means processing events on a latency lower than 0.25ms. Meaning I would have to make sure that Spark streaming gets new events from the messaging layer within that period of time. Would anyone have achieve such numbers using a Spark cluster? Or would thi

Re: How do you write Dataframes to elasticsearch

2015-03-25 Thread Nick Pentreath
Spark 1.3 is not supported by elasticsearch-hadoop yet but will be very soon:  https://github.com/elastic/elasticsearch-hadoop/issues/400 However in the meantime you could use df.toRDD.saveToEs - though you may have to manipulate the Row object perhaps to extract fields, not sure if it will

What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
I have a SparkSQL dataframe with a a few billion rows that I need to quickly filter down to a few hundred thousand rows, using an operation like (syntax may not be correct) df = df[ df.filter(lambda x: x.key_col in approved_keys)] I was thinking about serializing the data using parquet and saving

Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Sandy Ryza
Hi Sachin, It appears that the application master is failing. To figure out what's wrong you need to get the logs for the application master. -Sandy On Wed, Mar 25, 2015 at 7:05 AM, Sachin Singh wrote: > OS I am using Linux, > when I will run simply as master yarn, its running fine, > > Regar

Re: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread Arush Kharbanda
Did you built for kineses using profile *-Pkinesis-asl* On Wed, Mar 25, 2015 at 7:18 PM, ankur.jain wrote: > Hi, > I am trying to run a Spark on YARN program provided by Spark in the > examples > directory using Amazon Kinesis on EMR cluster : > I am using Spark 1.3.0 and EMR AMI: 3.5.0 > > I've

foreachRDD execution

2015-03-25 Thread Luis Ángel Vicente Sánchez
I have a simple and probably dumb question about foreachRDD. We are using spark streaming + cassandra to compute concurrent users every 5min. Our batch size is 10secs and our block interval is 2.5secs. At the end of the world we are using foreachRDD to join the data in the RDD with existing data

Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread Akhil Das
You can open the Master UI running on 8080 port of your ubuntu machine and after submitting the job, you can see how many cores are being used etc from the UI. Thanks Best Regards On Wed, Mar 25, 2015 at 6:50 PM, James King wrote: > Thanks Akhil, > > Yes indeed this is why it works when using l

JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread ankur.jain
Hi, I am trying to run a Spark on YARN program provided by Spark in the examples directory using Amazon Kinesis on EMR cluster : I am using Spark 1.3.0 and EMR AMI: 3.5.0 I've setup the Credentials export AWS_ACCESS_KEY_ID=XX export AWS_SECRET_KEY=XXX *A) This is the Kinesis Word Count

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread Peter Rudenko
Hi Martin, here’s 2 possibilities to overcome this: 1) Put your logic into org.apache.spark package in your project - then everything would be accessible. 2) Dirty trick: |object SparkVector extends HashingTF { val VectorUDT: DataType = outputDataType } | then you can do like this: |Struct

Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread James King
Thanks Akhil, Yes indeed this is why it works when using local[2] but I'm unclear of why it doesn't work when using standalone daemons? Is there way to check what cores are being seen when running against standalone daemons? I'm running the master and worker on same ubuntu host. The Driver progr

Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-25 Thread , Roy
Yes I do have other application already running. Thanks for your explanation. On Wed, Mar 25, 2015 at 2:49 AM, Akhil Das wrote: > It means you are already having 4 applications running on 4040, 4041, > 4042, 4043. And that's why it was able to run on 4044. > > You can do a *netstat -pnat | gr

Re: spark worker on mesos slave | possible networking config issue

2015-03-25 Thread Akhil Das
Remove SPARK_LOCAL_IP then? Thanks Best Regards On Wed, Mar 25, 2015 at 6:45 PM, Anirudha Jadhav wrote: > is there a way to have this dynamically pick the local IP. > > static assignment does not work cos the workers are dynamically allocated > on mesos > > On Wed, Mar 25, 2015 at 3:04 AM, Akh

  1   2   >