Setting Optimal Number of Spark Executor Instances

2017-03-14 Thread kpeng1
Hi All, I am currently on Spark 1.6 and I was doing a sql join on two tables that are over 100 million rows each and I noticed that it was spawn 3+ tasks (this is the progress meter that we are seeing show up). We tried to coalesece, repartition and shuffle partitions to drop the number of

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread kpeng1
Also, the results of the inner query produced the same results: sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN dps_pin_promo_lt d ON (s.date

Weird results with Spark SQL Outer joins

2016-05-02 Thread kpeng1
Hi All, I am running into a weird result with Spark SQL Outer joins. The results for all of them seem to be the same, which does not make sense due to the data. Here are the queries that I am running with the results: sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS

println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread kpeng1
Hi All, I am currently trying to debug a spark application written in scala. I have a main method: def main(args: Array[String]) { ... SocialUtil.triggerAndWait(triggerUrl) ... The SocialUtil object is included in a seperate jar. I launched the spark-submit command using --jars

Using regular rdd transforms on schemaRDD

2015-03-17 Thread kpeng1
Hi All, I was wondering how rdd transformation work on schemaRDDs. Is there a way to force the rdd transform to keep the schemaRDD types or do I need to recreate the schemaRDD by applying the applySchema method? Currently what I have is an array of SchemaRDDs and I just want to do a union

Re: Using regular rdd transforms on schemaRDD

2015-03-17 Thread kpeng1
Looks like if I use unionAll this works. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-regular-rdd-transforms-on-schemaRDD-tp22105p22107.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Creating a hive table on top of a parquet file written out by spark

2015-03-16 Thread kpeng1
Hi All, I wrote out a complex parquet file from spark sql and now I am trying to put a hive table on top. I am running into issues with creating the hive table itself. Here is the json that I wrote out to parquet using spark sql:

Loading in json with spark sql

2015-03-13 Thread kpeng1
Hi All, I was noodling around with loading in a json file into spark sql's hive context and I noticed that I get the following message after loading in the json file: PhysicalRDD [_corrupt_record#0], MappedRDD[5] at map at JsonRDD.scala:47 I am using the HiveContext to load in the json file

spark sql writing in avro

2015-03-12 Thread kpeng1
Hi All, I am current trying to write out a scheme RDD to avro. I noticed that there is a databricks spark-avro library and I have included that in my dependencies, but it looks like I am not able to access the AvroSaver object. On compilation of the job I get this: error: not found: value

Writing wide parquet file in Spark SQL

2015-03-10 Thread kpeng1
Hi All, I am currently trying to write a very wide file into parquet using spark sql. I have 100K column records that I am trying to write out, but of course I am running into space issues(out of memory - heap space). I was wondering if there are any tweaks or work arounds for this. I am

Passing around SparkContext with in the Driver

2015-03-04 Thread kpeng1
Hi All, I am trying to create a class that wraps functionalities that I need; some of these functions require access to the SparkContext, which I would like to pass in. I know that the SparkContext is not seralizable, and I am not planning on passing it to worker nodes or anything, I just want

Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread kpeng1
Hi All, I am currently having problem with the maven dependencies for version 1.2.0 of spark-core and spark-hive. Here are my dependencies: dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.2.0/version /dependency dependency

Issues reading in Json file with spark sql

2015-03-02 Thread kpeng1
Hi All, I am currently having issues reading in a json file using spark sql's api. Here is what the json file looks like: { namespace: spacey, name: namer, type: record, fields: [ {name:f1,type:[null,string]}, {name:f2,type:[null,string]}, {name:f3,type:[null,string]},

Spark SQL Converting RDD to SchemaRDD without hardcoding a case class in scala

2015-02-27 Thread kpeng1
Hi All, I am currently trying to build out a spark job that would basically convert a csv file into parquet. From what I have seen it looks like spark sql is the way to go and how I would go about this would be to load in the csv file into an RDD and convert it into a schemaRDD by injecting in

Spark streaming on Yarn

2014-11-17 Thread kpeng1
Hi, I have been using spark streaming in standalone mode and now I want to migrate to spark running on yarn, but I am not sure how you would you would go about designating a specific node in the cluster to act as an avro listener since I am using flume based push approach with spark. -- View

Is it possible to call a transform + action inside an action?

2014-10-28 Thread kpeng1
I currently writing an application that uses spark streaming. What I am trying to do is basically read in a few files (I do this by using the spark context textFile) and then process those files inside an action that I apply to a streaming RDD. Here is the main code below: def main(args:

How to properly debug spark streaming?

2014-10-28 Thread kpeng1
I am still fairly new to spark and spark streaming. I have been struggling with how to properly debug spark streaming and I was wondering what is the best approach. I have been basically putting println statements everywhere, but sometimes they show up when I run the job and sometimes they

RE: Is it possible to call a transform + action inside an action?

2014-10-28 Thread kpeng1
Ok cool. So in that case the only way I could think of doing this would be calling the toArray method on those RDDs which would return Array[String] and store them as broadcast variables. I read about the broadcast variables, but it still fuzzy. I am assume that since broadcast variables are

Invalid signature file digest for Manifest main attributes with spark job built using maven

2014-09-15 Thread kpeng1
Hi All, I am trying to submit a spark job that I have built in maven using the following command: /usr/bin/spark-submit --deploy-mode client --class com.spark.TheMain --master local[1] /home/cloudera/myjar.jar 100 But I seem to be getting the following error: Exception in thread main

Spark Streaming into HBase

2014-09-03 Thread kpeng1
I have been trying to understand how spark streaming and hbase connect, but have not been successful. What I am trying to do is given a spark stream, process that stream and store the results in an hbase table. So far this is what I have: import org.apache.spark.SparkConf import

Re: Spark Streaming into HBase

2014-09-03 Thread kpeng1
in the classpath ? Do you observe any exception from the code below or in region server log ? Which hbase release are you using ? On Wed, Sep 3, 2014 at 2:05 PM, kpeng1 [hidden email] http://user/SendEmail.jtp?type=nodenode=13385i=3 wrote: I have been trying to understand how spark streaming

Re: Issue Connecting to HBase in spark shell

2014-08-27 Thread kpeng1
It looks like the issue I had is that I didn't pull in htrace-core jar into the spark class path. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-Connecting-to-HBase-in-spark-shell-tp12855p12924.html Sent from the Apache Spark User List mailing list