Spark DataFrame sum of multiple columns

2016-04-21 Thread Naveen Kumar Pokala
Hi, Do we have any way to perform Row level operations in spark dataframes. For example, I have a dataframe with columns from A,B,C,...Z.. I want to add one more column New Column with sum of all column values. A B C D . . . Z New Column 1 2 4 3 26 351 Can somebody

reading EOF exception while reading parquet ile from hadoop

2016-04-20 Thread Naveen Kumar Pokala
read.java:745) Thanks, Naveen Kumar Pokala [cid:image001.jpg@01D19B26.32EE0FE0]

Standard deviation on multiple columns

2016-04-18 Thread Naveen Kumar Pokala
Hi, I am using spark 1.6.0 I want to find standard deviation of columns that will come dynamically. val stdDevOnAll = columnNames.map { x => stddev(x } causalDf.groupBy(causalDf("A"),causalDf("B"),causalDf("C")) .agg(stdDevOnAll:_*) //error line I am trying to do as above. But it

Failed to locate the winutils binary in the hadoop binary path

2015-01-29 Thread Naveen Kumar Pokala
Hi, I am facing the following issue when I am connecting from spark-shell. Please tell me how to avoid it. 15/01/29 17:21:27 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop

Pyspark Interactive shell

2015-01-06 Thread Naveen Kumar Pokala
Hi, Anybody tried to connect to spark cluster( on UNIX machines) from windows interactive shell ? -Naveen.

pyspark.daemon not found

2014-12-31 Thread Naveen Kumar Pokala
Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master/python: Please can somebody help me on this, how to resolve the issue. -Naveen

Re: pyspark.daemon not found

2014-12-31 Thread Naveen Kumar Pokala
From: Naveen Kumar Pokala [mailto:npok...@spcapitaliq.com] Sent: Wednesday, December 31, 2014 2:28 PM To: user@spark.apache.org Subject: pyspark.daemon not found Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master

python: module pyspark.daemon not found

2014-12-29 Thread Naveen Kumar Pokala
14/12/29 18:10:56 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes) 14/12/29 18:10:56 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException ( Error from python worker:

Spark Job submit

2014-11-26 Thread Naveen Kumar Pokala
Hi. Is there a way to submit spark job on Hadoop-YARN cluster from java code. -Naveen

RE: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread Naveen Kumar Pokala
Hi, While submitting your spark job mention --executor-cores 2 --num-executors 24 it will divide the dataset into 24*2 parquet files. Or set spark.default.parallelism value like 50 on sparkconf object. It will divide the dataset into 50 files into your HDFS. -Naveen -Original

Submit Spark driver on Yarn Cluster in client mode

2014-11-24 Thread Naveen Kumar Pokala
Hi, I want to submit my spark program from my machine on a YARN Cluster in yarn client mode. How to specify al l the required details through SPARK submitter. Please provide me some details. -Naveen.

Re: Submit Spark driver on Yarn Cluster in client mode

2014-11-24 Thread Naveen Kumar Pokala
, 2014 4:08 PM To: Naveen Kumar Pokala Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Submit Spark driver on Yarn Cluster in client mode You can export the hadoop configurations dir (export HADOOP_CONF_DIR=XXX) in the environment and then submit it like: ./bin/spark-submit

RE: Null pointer exception with larger datasets

2014-11-18 Thread Naveen Kumar Pokala
Thanks Akhil. -Naveen. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, November 18, 2014 1:19 PM To: Naveen Kumar Pokala Cc: user@spark.apache.org Subject: Re: Null pointer exception with larger datasets Make sure your list is not null, if that is null then its more like

HDFS read text file

2014-11-17 Thread Naveen Kumar Pokala
Hi, JavaRDDInstrument studentsData = sc.parallelize(list);--list is Student Info ListStudent studentsData.saveAsTextFile(hdfs://master/data/spark/instruments.txt); above statements saved the students information in the HDFS as a text file. Each object is a line in text file as below.

Null pointer exception with larger datasets

2014-11-17 Thread Naveen Kumar Pokala
Hi, I am having list Students and size is one Lakh and I am trying to save the file. It is throwing null pointer exception. JavaRDDStudent distData = sc.parallelize(list); distData.saveAsTextFile(hdfs://master/data/spark/instruments.txt); 14/11/18 01:33:21 WARN scheduler.TaskSetManager: Lost

Spark GCLIB error

2014-11-13 Thread Naveen Kumar Pokala
Hi, I am receiving following error when I am trying to run sample spark program. Caused by: java.lang.UnsatisfiedLinkError:

RE: scala.MatchError

2014-11-12 Thread Naveen Kumar Pokala
) case class Instrument(issue: Issue = null) -Naveen From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, November 12, 2014 12:09 AM To: Xiangrui Meng Cc: Naveen Kumar Pokala; user@spark.apache.org Subject: Re: scala.MatchError Xiangrui is correct that is must be a java bean

Spark SQL configurations

2014-11-12 Thread Naveen Kumar Pokala
[cid:image001.png@01CFFE9C.25904980] Hi, How to set the above properties on JavaSQLContext. I am not able to see setConf method on JavaSQLContext Object. I have added spark core jar and spark assembly jar to my build path. And I am using spark 1.1.0 and hadoop 2.4.0 --Naveen

RE: Spark SQL configurations

2014-11-12 Thread Naveen Kumar Pokala
Thanks Akhil. -Naveen From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Wednesday, November 12, 2014 6:38 PM To: Naveen Kumar Pokala Cc: user@spark.apache.org Subject: Re: Spark SQL configurations JavaSQLContext.sqlContext.setConf is available. Thanks Best Regards On Wed, Nov 12, 2014

Snappy error with Spark SQL

2014-11-12 Thread Naveen Kumar Pokala
HI, I am facing the following problem when I am trying to save my RDD as parquet File. 14/11/12 07:43:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 48,): org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null

save as file

2014-11-11 Thread Naveen Kumar Pokala
Hi, I am spark 1.1.0. I need a help regarding saving rdd in a JSON file? How to do that? And how to mentions hdfs path in the program. -Naveen

scala.MatchError

2014-11-11 Thread Naveen Kumar Pokala
Hi, This is my Instrument java constructor. public Instrument(Issue issue, Issuer issuer, Issuing issuing) { super(); this.issue = issue; this.issuer = issuer;

Parallelize on spark context

2014-11-06 Thread Naveen Kumar Pokala
Hi, JavaRDDInteger distData = sc.parallelize(data); On what basis parallelize splits the data into multiple datasets. How to handle if we want these many datasets to be executed per executor? For example, my data is of 1000 integers list and I am having 2 node yarn cluster. It is diving into

RE: Parallelize on spark context

2014-11-06 Thread Naveen Kumar Pokala
. How to check how many cores are running to complete task of 8 datasets?(Is there any commands or UI to check that) Regards, Naveen. From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of Holden Karau Sent: Friday, November 07, 2014 12:46 PM To: Naveen Kumar Pokala Cc: user

Number cores split up

2014-11-05 Thread Naveen Kumar Pokala
Hi, I have a 2 node yarn cluster and I am using spark 1.1.0 to submit my tasks. As per the documentation of spark, number of cores are maximum cores available. So does it mean each node creates no of cores = no of threads to process the job assigned to that node. For ex, ListInteger

Spark Debugging

2014-10-30 Thread Naveen Kumar Pokala
Hi, I have installed 2 node hadoop cluster (For example, on Unix machines A and B. A master node and data node, B is data node) I am submitting my driver programs through SPARK 1.1.0 with bin/spark-submit from Putty Client from my Windows machine. I want to debug my program from Eclipse