Re: How to avoid duplicate column names after join with multiple conditions

2018-07-12 Thread Prem Sure
both same just pick one. > > On Thu, Jul 12, 2018 at 9:38 AM, Prem Sure wrote: > >> Hi Nirav, did you try >> .drop(df1(a) after join >> >> Thanks, >> Prem >> >> On Thu, Jul 12, 2018 at 9:50 PM Nirav Patel >> wrote: >> >>> Hi

Re: How to avoid duplicate column names after join with multiple conditions

2018-07-12 Thread Prem Sure
Hi Nirav, did you try .drop(df1(a) after join Thanks, Prem On Thu, Jul 12, 2018 at 9:50 PM Nirav Patel wrote: > Hi Vamshi, > > That api is very restricted and not generic enough. It imposes that all > conditions of joins has to have same column on both side and it also has to > be equijoin. It

Re: how to specify external jars in program with SparkConf

2018-07-12 Thread Prem Sure
I think JVM is initiated with available classpath by the time your conf execution comes... I faced this earlier during Spark1.6 and ended up moving to Spark Submit using --jars found it was not part of runtime config changes.. May I know the advantage you are trying to get programmatically On

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-04 Thread Prem Sure
try .pipe(.py) on RDD Thanks, Prem On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri wrote: > Can someone please suggest me , thanks > > On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, > wrote: > >> Hello Dear Spark User / Dev, >> >> I would like to pass Python user defined function to Spark Job

Re: Inferring Data driven Spark parameters

2018-07-04 Thread Prem Sure
Can you share the API that your jobs use.. just core RDDs or SQL or DStreams..etc? refer recommendations from https://spark.apache.org/docs/2.3.0/configuration.html for detailed configurations. Thanks, Prem On Wed, Jul 4, 2018 at 12:34 PM, Aakash Basu wrote: > I do not want to change

Re: [Spark Streaming MEMORY_ONLY] Understanding Dataflow

2018-07-04 Thread Prem Sure
Hoping below would help in clearing some.. executors dont have control to share the data among themselves except sharing accumulators via driver's support. Its all based on the data locality or remote nature, tasks/stages are defined to perform which may result in shuffle. On Wed, Jul 4, 2018 at

Re: How to set spark.driver.memory?

2018-06-19 Thread Prem Sure
Hi, Can you share the exception? You need to give the value as well right after --driver-memory. First preference goes to the config keyval pairs defined in spark-submit and then only to spark-defaults.con. You can refer docs for the exact variable name Thanks, Prem On Tue, Jun 19, 2018 at 5:47

Re: Spark application fail wit numRecords error

2017-11-01 Thread Prem Sure
Hi, any offset left over for new topic consumption?, case can be the offset is beyond current latest offset and cuasing negative. hoping kafka brokers health is good and are up, this can also be a reason sometimes. On Wed, Nov 1, 2017 at 11:40 AM, Serkan TAS wrote: >

Re: Error using collectAsMap() in scala

2016-03-20 Thread Prem Sure
any specific reason you would like to use collectasmap only? You probably move to normal RDD instead of a Pair. On Monday, March 21, 2016, Mark Hamstra wrote: > You're not getting what Ted is telling you. Your `dict` is an RDD[String] > -- i.e. it is a collection of

Re: Spark Certification

2016-02-11 Thread Prem Sure
I did recently. it includes MLib & Graphx too and I felt like exam content covered all topics till 1.3 and not the > 1.3 versions of spark. On Thu, Feb 11, 2016 at 9:39 AM, Janardhan Karri wrote: > I am planning to do that with databricks >

Re: How to view the RDD data based on Partition

2016-01-12 Thread Prem Sure
try mapPartitionsWithIndex .. below is an example I used earlier. myfunc logic can be further modified as per your need. val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3) def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = { iter.toList.map(x => index + "," + x).iterator }

Re: How to view the RDD data based on Partition

2016-01-12 Thread Prem Sure
quot;345","") ("","345") => "0" -- resulting length is 0 ("0","") => "0" -- min length becomes zero again. Final merge: ("1","0") => "10" Hope this helps On Tue, Jan 12, 2016 at 2:53

Re: Create a n x n graph given only the vertices no

2016-01-10 Thread Prem Sure
you mean with out edges data? I dont think so. The other-way is possible..by calling fromEdges on Graph (this would assign vertices mentioned by edges default value ). please share your need/requirement in detail if possible.. On Sun, Jan 10, 2016 at 10:19 PM, praveen S

Re: Spark job uses only one Worker

2016-01-08 Thread Prem Sure
to narrow down,you can try below 1) is the job going to same node everytime( when you execute job multiple times)?. enable property spark.speculation, keep thread.sleep for 2 mins and see if the job is going to a different worker from the executor posted on initially. ( trying to find, there are

Re: adding jars - hive on spark cdh 5.4.3

2016-01-07 Thread Prem Sure
did you try -- jars property in spark submit? if your jar is of huge size, you can pre-load the jar on all executors in a common available directory to avoid network IO. On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion wrote: > I' trying to add jars before running a query

Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-07 Thread Prem Sure
you many need to add createDataFrame( for Python, inferschema) call before registerTempTable. Thanks, Prem On Thu, Jan 7, 2016 at 12:53 PM, Henrik Baastrup < henrik.baast...@netscout.com> wrote: > Hi All, > > I have a small Hadoop cluster where I have stored a lot of data in parquet >

Re: Question in rdd caching in memory using persist

2016-01-07 Thread Prem Sure
are you running standalone - local mode or cluster mode. executor and driver existance differ based on setup type. snapshot of your env UI would be helpful to say On Thu, Jan 7, 2016 at 11:51 AM, wrote: > Hi, > > > > After I called rdd.persist(*MEMORY_ONLY_SER*), I

Re: sparkR ORC support.

2016-01-05 Thread Prem Sure
Yes Sandeep, also copy hive-site.xml too to spark conf directory. On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana wrote: > Also, do I need to setup hive in spark as per the link > http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark > ? > > We might

Exception in thread "main" java.lang.IncompatibleClassChangeError:

2015-12-04 Thread Prem Sure
Getting below exception while executing below program in eclipse. any clue on whats wrong here would be helpful *public* *class* WordCount { *private* *static* *final* FlatMapFunction *WORDS_EXTRACTOR* = *new* *FlatMapFunction()* { @Override *public* Iterable

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread Prem Sure
I think automatic driver restart will happen, if driver fails with non-zero exit code. --deploy-mode cluster --supervise On Wed, Nov 25, 2015 at 1:46 PM, SRK wrote: > Hi, > > I am submitting my Spark job with supervise option as shown below. When I > kill the

Re: Queue in Spark standalone mode

2015-11-25 Thread Prem Sure
spark standalone mode submitted applications will run in FIFO (first-in-first-out) order. please elaborate "strange behavior while running multiple jobs simultaneously." On Wed, Nov 25, 2015 at 2:29 PM, sunil m <260885smanik...@gmail.com> wrote: > Hi! > > I am using Spark 1.5.1 and pretty new

Re: Spark 1.6 Build

2015-11-24 Thread Prem Sure
you can refer..: https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I'm not able to build Spark 1.6 from source. Could you please