spark + SalesForce SSL HandShake Issue

2017-12-17 Thread kali.tumm...@gmail.com
Hi All, I was trying out spark + SalesforceLibabry on cloudera 5.9 I am having SSL handhshake issue please check out my question on stack over flow no one answered. The library works ok on windows it fails when I try to run on cloudera edge node.

sql to spark scala rdd

2016-07-29 Thread kali.tumm...@gmail.com
Hi All, I managed to write business requirement in spark-sql and hive I am still learning scala how this below sql be written using spark RDD not spark data frames. SELECT DATE,balance, SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) daily_balance FROM table

spark local dir to HDFS ?

2016-07-05 Thread kali.tumm...@gmail.com
Hi All, can I set spark.local.dir to HDFS location instead of /tmp folder ? I tried setting up temp folder to HDFS but it didn't worked can spark.local.dir write to HDFS ? .set("spark.local.dir","hdfs://namednode/spark_tmp/") 16/07/05 15:35:47 ERROR DiskBlockManager: Failed to create local

Re: spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
I found the jira for the issue will there be a fix in future ? or no fix ? https://issues.apache.org/jira/browse/SPARK-6221 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html Sent from the Apache Spark

Re: spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
Hi Neelesh, I told you in my emails it's not spark-Scala application , I am working on just spark SQL. I am launching spark-SQL shell and running my hive code inside spark SQL she'll. Spark SQL she'll accepts functions which relate to spark SQL doesn't accepts fictions like collasece which is

spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
Hi All, I am running hive in spark-sql in yarn client mode, the sql is pretty simple load dynamic partitions to target parquet table. I used hive configurations parameters such as (set hive.merge.smallfiles.avgsize=25600;set hive.merge.size.per.task=256000;) which usually merges small

Re: Best way to merge final output part files created by Spark job

2016-07-01 Thread kali.tumm...@gmail.com
Try using collasece function to repartition to desired number of partitions files, to merge already output files use hive and insert overwrite table using below options. set hive.merge.smallfiles.avgsize=256; set hive.merge.size.per.task=256; set -- View this message in context:

Re: output part files max size

2016-07-01 Thread kali.tumm...@gmail.com
I am not sure but you can use collasece function to reduce number of output files . Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/output-part-files-max-size-tp17013p27262.html Sent from the Apache Spark User List mailing list archive at

Spark partition formula on standalone mode?

2016-06-27 Thread kali.tumm...@gmail.com
Hi All, I did worked on spark installed on Hadoop cluster but never worked on spark on standalone cluster. My question how to set number of partitions in spark when it's running on spark standalone cluster? If spark on Hadoop I calculate my formula using hdfs block sizes but how I calculate

spark sql broadcast join ?

2016-06-16 Thread kali.tumm...@gmail.com
Hi All, I had used broadcast join in spark-scala applications I did used partitionby (Hash Partitioner) and then persit for wide dependencies, present project which I am working on pretty much Hive migration to spark-sql which is pretty much sql to be honest no scala or python apps. My question

Re: Saprk 1.6 Driver Memory Issue

2016-06-01 Thread kali.tumm...@gmail.com
Hi , I am using spark-sql shell wile launching I am running it as spark-sql --conf spark.driver.maxResultSize=20g I tried using spark-sql --conf "spark.driver.maxResults"="20g" but still no luck do I need to use set command something like spark-sql --conf set "spark.driver.maxReults"="20g"

Saprk 1.6 Driver Memory Issue

2016-06-01 Thread kali.tumm...@gmail.com
Hi All , I am getting spark driver memory issue even after overriding the conf by using --conf spark.driver.maxResultSize=20g and I also mentioned in my sql script (set spark.driver.maxResultSize =16;) but still the same error happening. Job aborted due to stage failure: Total size of

set spark 1.6 with Hive 0.14 ?

2016-05-20 Thread kali.tumm...@gmail.com
Hi All , Is there a way to ask spark and spark-sql to use Hive 0.14 version instead of inbuilt hive 1.2.1. I am testing spark-sql locally by downloading spark 1.6 from internet , I want to execute my hive queries in spark sql using hive version 0.14 can I go back to previous version just for a

how to run latest version of spark in old version of spark in cloudera cluster ?

2016-01-27 Thread kali.tumm...@gmail.com
Hi All, Just realized cloudera version of spark on my cluster is 1.2, the jar which I built using maven is version 1.6 which is causing issue. Is there a way to run spark version 1.6 in 1.2 version of spark ? Thanks Sri -- View this message in context:

org.netezza.error.NzSQLException: ERROR: Invalid datatype - TEXT

2016-01-26 Thread kali.tumm...@gmail.com
Hi All, I am using Spark jdbc df to store data into Netezza , I think spark is trying to create table using data type TEXT for string column , netezza doesn't support data type text. how to overwrite spark method to use VARCHAR instead of data type text ? val

Re: org.netezza.error.NzSQLException: ERROR: Invalid datatype - TEXT

2016-01-26 Thread kali.tumm...@gmail.com
Fixed by creating a new netezza Dialect and registered in jdbcDialects using JdbcDialects.registerDialect(NetezzaDialect) method (spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala) package com.citi.ocean.spark.elt /** * Created by st84879 on 26/01/2016. */ import

Re: spark 1.6 Issue

2016-01-08 Thread kali.tumm...@gmail.com
Hi All, worked OK by adding below in VM options. -Xms128m -Xmx512m -XX:MaxPermSize=300m -ea Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-Issue-tp25893p25920.html Sent from the Apache Spark User List mailing list archive at

spark 1.6 Issue

2016-01-06 Thread kali.tumm...@gmail.com
Hi All, I am running my app in IntelliJ Idea (locally) my config local[*] , the code worked ok with spark 1.5 but when I upgraded to 1.6 I am having below issue. is this a bug in 1.6 ? I change back to 1.5 it worked ok without any error do I need to pass executor memory while running in local

Apache spark certification pass percentage ?

2015-12-22 Thread kali.tumm...@gmail.com
Hi All, Does anyone know pass percentage for Apache spark certification exam ? Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-spark-certification-pass-percentage-tp25761.html Sent from the Apache Spark User List mailing list archive

how to turn off spark streaming gracefully ?

2015-12-18 Thread kali.tumm...@gmail.com
Hi All, Imagine I have a Production spark streaming kafka (direct connection) subscriber and publisher jobs running which publish and subscriber (receive) data from a kafka topic and I save one day's worth of data using dstream.slice to Cassandra daily table (so I create daily table before

Re: spark data frame write.mode("append") bug

2015-12-12 Thread kali.tumm...@gmail.com
Hi All, https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48 In Present spark version in line 48 there is a bug, to check whether table exists in a database using limit doesnt work for all databases sql

spark data frame write.mode("append") bug

2015-12-09 Thread kali.tumm...@gmail.com
Hi Spark Contributors, I am trying to append data to target table using df.write.mode("append") functionality but spark throwing up table already exists exception. Is there a fix scheduled in later spark release ?, I am using spark 1.5. val sourcedfmode=sourcedf.write.mode("append")

Re: can i write only RDD transformation into hdfs or any other storage system

2015-12-09 Thread kali.tumm...@gmail.com
Hi Prateek, you mean writing spark output to any storage system ? yes you can . Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-i-write-only-RDD-transformation-into-hdfs-or-any-other-storage-system-tp25637p25651.html Sent from the Apache

Release data for spark 1.6?

2015-12-09 Thread kali.tumm...@gmail.com
Hi All, does anyone know exact release data for spark 1.6 ? Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Release-data-for-spark-1-6-tp25654.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Exception in Spark-sql insertIntoJDBC command

2015-12-08 Thread kali.tumm...@gmail.com
Hi All, I have the same error in spark 1.5 is there any solution to get around with this ? I also tried using sourcedf.write.mode("append") but still no luck . val sourcedfmode=sourcedf.write.mode("append") sourcedfmode.jdbc(TargetDBinfo.url,TargetDBinfo.table,targetprops) Thanks Sri --

spark sql current time stamp function ?

2015-12-07 Thread kali.tumm...@gmail.com
Hi All, Is there a spark sql function which returns current time stamp Example:- In Impala:- select NOW(); SQL Server:- select GETDATE(); Netezza:- select NOW(); Thanks Sri -- View this message in context:

Re: spark sql current time stamp function ?

2015-12-07 Thread kali.tumm...@gmail.com
I found a way out. import java.text.SimpleDateFormat import java.util.Date; val format = new SimpleDateFormat("-M-dd hh:mm:ss") val testsql=sqlContext.sql("select column1,column2,column3,column4,column5 ,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new Date(

Spark sql random number or sequence numbers ?

2015-12-07 Thread kali.tumm...@gmail.com
Hi All, I did implemented random_numbers using scala spark , is there a function to get row_number equivalent in spark sql ? example:- sql server:-row_number() Netezza:- sequence number mysql:- sequence number Example:- val testsql=sqlContext.sql("select

Spark sql data frames do they run in parallel by default?

2015-12-06 Thread kali.tumm...@gmail.com
Hi all, I wrote below spark code to extract data from SQL server using spark SQLContext.read.format with several different options , question does by default sqlContext.read load function run in parallel does it use all the available cores available ? when I am saving the output to a file it is

Re: Spark sql data frames do they run in parallel by default?

2015-12-06 Thread kali.tumm...@gmail.com
Hi All, I re wrote my code to use sqlContext.read.jdbc which lets me specify upperbound,lowerbound,numberofparitions etc .. which might run in parallel, I need to try on a cluster which I will do when I have time. But please confirm read.jdbc does parallel reads ? Spark code:- package

Does spark streaming write ahead log writes all received data to HDFS ?

2015-11-20 Thread kali.tumm...@gmail.com
Hi All, If write ahead logs are enabled in spark streaming does all the received data gets written to HDFS path ? or it only writes the metadata. How does clean up works , does HDFS path gets bigger and bigger up everyday do I need to write an clean up job to delete data from write ahead logs

spark inner join

2015-10-24 Thread kali.tumm...@gmail.com
Hi All, In sql say for example I have table1 (moveid) and table2 (movieid,moviename) in sql we write something like select moviename ,movieid,count(1) from table2 inner join table table1 on table1.movieid=table2.moveid group by , here in sql table1 has only one column where as table 2 has

Saprk error:- Not a valid DFS File name

2015-10-23 Thread kali.tumm...@gmail.com
Hi All, got this weird error when I tried to run spark on YARN-CLUSTER mode , I have 33 files and I am looping spark in bash one by one most of them worked ok except few files. Is this below error HDFS or spark error ? Exception in thread "Driver" java.lang.IllegalArgumentException: Pathname

Re: Saprk error:- Not a valid DFS File name

2015-10-23 Thread kali.tumm...@gmail.com
Full Error:- at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:195) at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:104) at

difference between rdd.collect().toMap to rdd.collectAsMap() ?

2015-10-20 Thread kali.tumm...@gmail.com
Hi All, Is there any performance impact when I use collectAsMap on my RDD instead of rdd.collect().toMap ? I have a key value rdd and I want to convert to HashMap as far I know collect() is not efficient on large data sets as it runs on driver can I use collectAsMap instead is there any

Pass spark partition explicitly ?

2015-10-18 Thread kali.tumm...@gmail.com
Hi All, can I pass number of partitions to all the RDD explicitly while submitting the spark Job or di=o I need to mention in my spark code itself ? Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pass-spark-partition-explicitly-tp25113.html

can I use Spark as alternative for gem fire cache ?

2015-10-17 Thread kali.tumm...@gmail.com
Hi All, Can spark be used as an alternative to gem fire cache ? we use gem fire cache to save (cache) dimension data in memory which is later used by our Java custom made ETL tool can I do something like below ? can I cache a RDD in memory for a whole day ? as of I know RDD will get empty once

Output println info in LogMessage Info ?

2015-10-17 Thread kali.tumm...@gmail.com
Hi All, I n Unix I can print some warning or info using LogMessage WARN "Hi All" or LogMessage INFO "Hello World" is there similar thing in Spark ? Imagine I wan to print count of RDD in Logs instead of using Println Thanks Sri -- View this message in context:

Is there any better way of writing this code

2015-10-12 Thread kali.tumm...@gmail.com
Hi All, just wonderign is there any better way of writing this below code, I am new to spark an I feel what I wrote is pretty simple and basic and straight forward is there any better way of writing using functional paradigm. val QuoteRDD=quotefile.map(x => x.split("\\|")). filter(line

Re: Create hashmap using two RDD's

2015-10-10 Thread kali.tumm...@gmail.com
Got it ..., created hashmap and saved it to file please follow below steps .. val QuoteRDD=quotefile.map(x => x.split("\\|")). filter(line => line(0).contains("1017")). map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) , if (x(15) =="B") ( {if (x(25) == "") x(9) else

Re: Create hashmap using two RDD's

2015-10-10 Thread kali.tumm...@gmail.com
Hi All, I changed my way of approach now I am bale to load data into MAP and get data out using get command. val QuoteRDD=quotefile.map(x => x.split("\\|")). filter(line => line(0).contains("1017")). map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) , if (x(15) =="B") if

Create hashmap using two RDD's

2015-10-09 Thread kali.tumm...@gmail.com
Hi all, I am trying to create a hashmap using two rdd, but having issues key not found do I need to convert RDD to list first ? 1) rdd has key data 2) rdd has value data Key Rdd:- val quotekey=file.map(x => x.split("\\|")).filter(line => line(0).contains("1017")).map(x => x(5)+x(4))

Re: KafkaProducer using Cassandra as source

2015-09-23 Thread kali.tumm...@gmail.com
Guys sorry I figured it out. val x=p.collect().mkString("\n").replace("[","").replace("]","").replace(",","~") Full Code:- package com.examples /** * Created by kalit_000 on 22/09/2015. */ import kafka.producer.KeyedMessage import kafka.producer.Producer import kafka.producer.ProducerConfig

Re: Kafka createDirectStream ​issue

2015-09-19 Thread kali.tumm...@gmail.com
Hi , I am trying to develop in intellij Idea same code I am having the same issue is there any work around. Error in intellij:- cannot resolve symbol createDirectStream import kafka.serializer.StringDecoder import org.apache.spark._ import org.apache.spark.SparkContext._ import

Re: Unable to see my kafka spark streaming output

2015-09-19 Thread kali.tumm...@gmail.com
Hi All, figured it out for got mention local as loca[2] , at least two node required. package com.examples /** * Created by kalit_000 on 19/09/2015. */ import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.sql.SQLContext import org.apache.spark.SparkConf

Unable to see my kafka spark streaming output

2015-09-19 Thread kali.tumm...@gmail.com
Hi All, I am unable to see the output getting printed in the console can anyone help. package com.examples /** * Created by kalit_000 on 19/09/2015. */ import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.sql.SQLContext import org.apache.spark.SparkConf

word count (group by users) in spark

2015-09-19 Thread kali.tumm...@gmail.com
Hi All, I would like to achieve this below output using spark , I managed to write in Hive and call it in spark but not in just spark (scala), how to group word counts on particular user (column) for example. Imagine users and their given tweets I want to do word count based on user name.

split function on spark sql created rdd

2015-05-23 Thread kali.tumm...@gmail.com
Hi All, I am trying to do word count on number of tweets, my first step is to get data from table using spark sql and then run split function on top of it to calculate word count. Error:- valuse split is not a member of org.apache.spark.sql.SchemaRdd Spark Code that doesn't work to do word