Re: Spark yarn cluster

2020-07-11 Thread Diwakar Dhanuskodi
need to use > other components of the Hadoop cluster, namely MapReduce and HDFS. > > That being said, if you just need cluster scheduling and not using > MapReduce nor HDFS it is possible you will be fine with the Spark > Standalone cluster. > > Regards, > Juan Martín

Spark yarn cluster

2020-07-11 Thread Diwakar Dhanuskodi
Hi , Could it be possible to setup Spark within Yarn cluster which may not have Hadoop?. Thanks.

Foreachpartition in spark streaming

2017-03-20 Thread Diwakar Dhanuskodi
Just wanted to clarify!!! Is foreachPartition in spark an output operation? Which one is better use mapPartitions or foreachPartitions? Regards Diwakar

Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-09-03 Thread Diwakar Dhanuskodi
Please run with -X and post logs here. We can get exact error from it. On Sat, Sep 3, 2016 at 7:24 PM, Marco Mistroni wrote: > hi all > > i am getting failures when building spark 2.0 on Ubuntu 16.06 > Here's details of what i have installed on the ubuntu host > - java 8

Hive connection issues in spark-shell

2016-09-03 Thread Diwakar Dhanuskodi
Hi, I recently built spark using maven. Now when starting spark-shell, it couldn't connect hive and getting below error I couldn't find datanucleus jar in built library. But datanucleus jar is available in hive/lib folders. java.lang.ClassNotFoundException:

Re: Spark build 1.6.2 error

2016-09-03 Thread Diwakar Dhanuskodi
a > > On Sat 3 Sep, 2016, 12:14 PM Diwakar Dhanuskodi, < > diwakar.dhanusk...@gmail.com> wrote: > >> Hi, >> >> Just re-ran again without killing zinc server process >> >> /make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Phive >>

Re: Spark build 1.6.2 error

2016-09-03 Thread Diwakar Dhanuskodi
:44 AM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: > Hi, > > java version 7 > > mvn command > ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Phive > -Phive-thriftserver -Pyarn -Dmaven.version=3.0.4 > > > yes, I executed script to c

Re: Spark build 1.6.2 error

2016-09-03 Thread Diwakar Dhanuskodi
park-sql_2.11 On Thu, Sep 1, 2016 at 8:23 AM, Divya Gehlot <divya.htco...@gmail.com> wrote: > Which java version are you using ? > > On 31 August 2016 at 04:30, Diwakar Dhanuskodi < > diwakar.dhanusk...@gmail.com> wrote: > >> Hi, >> >> While bui

Spark build 1.6.2 error

2016-08-30 Thread Diwakar Dhanuskodi
Hi, While building Spark 1.6.2 , getting below error in spark-sql. Much appreciate for any help. ERROR] missing or invalid dependency detected while loading class file 'WebUI.class'. Could not access term eclipse in package org, because it (or its dependencies) are missing. Check your build

Re: Best way to read XML data from RDD

2016-08-22 Thread Diwakar Dhanuskodi
, Aug 22, 2016 at 4:34 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > Do you mind share your codes and sample data? It should be okay with > single XML if I remember this correctly. > > 2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi < > diwakar.dhanusk...@gmail.com>: >

Re: Best way to read XML data from RDD

2016-08-22 Thread Diwakar Dhanuskodi
.@gmail.com> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, Felix Cheung <felixcheun...@hotmail.com>, user <user@spark.apache.org> Subject: Re: Best way to read XML data from RDD Another option would be to look at spark-xml-utils. We use this extensively in the man

Re: Best way to read XML data from RDD

2016-08-22 Thread Diwakar Dhanuskodi
(GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc: Felix Cheung <felixcheun...@hotmail.com>, user <user@spark.apache.org> Subject: Re: Best way to read XML data from RDD I fear the issue is that this will create and destroy a XML parser object 2

Re: Best way to read XML data from RDD

2016-08-22 Thread Diwakar Dhanuskodi
: Jörn Franke <jornfra...@gmail.com> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, Felix Cheung <felixcheun...@hotmail.com>, user <user@spark.apache.org> Subject: Re: Best way to read XML data from RDD Hi Diwakar, Spark XML library can take RDD as sourc

Re: Best way to read XML data from RDD

2016-08-19 Thread Diwakar Dhanuskodi
to parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment.  Sent from Samsung Mobile. Original message From: Felix Cheung <felixcheun...@hotmail.com> Date:20/08/2016 09:49 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>,

Best way to read XML data from RDD

2016-08-19 Thread Diwakar Dhanuskodi
Hi, There is a RDD with json data. I could read json data using rdd.read.json . The json data has XML data in couple of key-value paris. Which is the best method to read and parse XML from rdd. Is there any specific xml libraries for spark. Could anyone help on this. Thanks.

Spark streaming

2016-08-18 Thread Diwakar Dhanuskodi
Hi, Is there a way to  specify in  createDirectStream to receive only last 'n' offsets of a specific topic and partition. I don't want to filter out in foreachRDD.   Sent from Samsung Mobile.

createDirectStream parallelism

2016-08-18 Thread Diwakar Dhanuskodi
We are using createDirectStream api to receive messages from 48 partitioned topic. I am setting up --num-executors 48 & --executor-cores 1 in spark-submit All partitions were parallely received and corresponding RDDs in foreachRDD were executed in parallel. But when I join a transformed RDD

RE: [Spark 2.0] ClassNotFoundException is thrown when using Hive

2016-08-18 Thread Diwakar Dhanuskodi
Hi Can  you  cross check by providing same library path in --jars of spark-submit and run . Sent from Samsung Mobile. Original message From: "颜发才(Yan Facai)" Date:18/08/2016 15:17 (GMT+05:30) To: "user.spark" Cc: Subject:

Re: KafkaUtils.createStream not picking smallest offset

2016-08-13 Thread Diwakar Dhanuskodi
From: Cody Koeninger <c...@koeninger.org> Date:12/08/2016 23:42 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, user@spark.apache.org Cc: Subject: Re: KafkaUtils.createStream not picking smallest offset Are you checkpointing? Beyond that, why are you using

KafkaUtils.createStream not picking smallest offset

2016-08-12 Thread Diwakar Dhanuskodi
Hi, We are  using  spark  1.6.1 and  kafka 0.9. KafkaUtils.createStream is  showing strange behaviour. Though   auto.offset.reset is  set  to  smallest .  Whenever we  need  to  restart  the   stream it  is  picking up  the  latest  offset which  is not  expected. Do  we  need  to  set  any  

Re: Spark streaming not processing messages from partitioned topics

2016-08-11 Thread Diwakar Dhanuskodi
Figured it out. All I am doing wrong is testing it out in pseudo node vm with 1 core. The tasks were hanging out for cpu. In production cluster this works just fine. On Thu, Aug 11, 2016 at 12:45 AM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: > Checked executor lo

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Diwakar Dhanuskodi
the kafka partitions? > If you use kafka-simple-consumer-shell.sh to read that partition, do > you get any data? > > On Wed, Aug 10, 2016 at 9:40 AM, Diwakar Dhanuskodi > <diwakar.dhanusk...@gmail.com> wrote: > > Hi Cody, > > > > Just added zookeeper.connect

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Diwakar Dhanuskodi
Cody Koeninger <c...@koeninger.org> wrote: > Those logs you're posting are from right after your failure, they don't > include what actually went wrong when attempting to read json. Look at your > logs more carefully. > On Aug 10, 2016 2:07 AM, "Diwakar Dhanuskodi" <

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Diwakar Dhanuskodi
Added jobs for time 147081258 ms 16/08/10 12:34:00 INFO JobScheduler: Added jobs for time 147081264 ms On Wed, Aug 10, 2016 at 10:26 AM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: > Hi Siva, > > Does topic has partitions? which version of Spark you are usi

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi
dd => { > if (rdd.isEmpty()) { > println("Failed to get data from Kafka. Please check that the Kafka > producer is streaming data.") > System.exit(-1) > } > val sqlContext = org.apache.spark.sql.SQLContext.getOrCreate(rdd. > sparkContext) > val wea

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi
at a time to the working > rdd.foreach example and see when it stops working, then take a closer > look at the logs. > > > On Tue, Aug 9, 2016 at 10:20 PM, Diwakar Dhanuskodi > <diwakar.dhanusk...@gmail.com> wrote: > > Hi Cody, > > > > Without conditional .

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi
ntext and just do > > rdd => { > rdd.foreach(println) > > > as a base line to see if you're reading the data you expect > > On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi > <diwakar.dhanusk...@gmail.com> wrote: > > Hi, > > > > I am reading jso

Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi
Hi, I am reading json messages from kafka . Topics has 2 partitions. When running streaming job using spark-submit, I could see that * val dataFrame = sqlContext.read.json(rdd.map(_._2)) *executes indefinitely. Am I doing something wrong here. Below is code .This environment is cloudera sandbox

Re: Spark streaming takes longer time to read json into dataframes

2016-07-19 Thread Diwakar Dhanuskodi
Mobile. Original message From: Cody Koeninger <c...@koeninger.org> Date:19/07/2016 20:49 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc: Martin Eden <martineden...@gmail.com>, user <user@spark.apache.org> Subject: Re: Spark str

Re: Spark streaming takes longer time to read json into dataframes

2016-07-19 Thread Diwakar Dhanuskodi
Mobile. Original message From: Cody Koeninger <c...@koeninger.org> Date:19/07/2016 20:49 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc: Martin Eden <martineden...@gmail.com>, user <user@spark.apache.org> Subject: Re: Spark str

Re: Spark streaming takes longer time to read json into dataframes

2016-07-17 Thread Diwakar Dhanuskodi
. Original message From: Martin Eden <martineden...@gmail.com> Date:16/07/2016 14:01 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc: user <user@spark.apache.org> Subject: Re: Spark streaming takes longer time to read json into dataframes Hi

Fwd: Spark streaming takes longer time to read json into dataframes

2016-07-15 Thread Diwakar Dhanuskodi
-- Forwarded message -- From: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Date: Sat, Jul 16, 2016 at 9:30 AM Subject: Re: Spark streaming takes longer time to read json into dataframes To: Jean Georges Perrin <j...@jgp.net> Hello, I need it on memory. Increa

Spark streaming takes longer time to read json into dataframes

2016-07-15 Thread Diwakar Dhanuskodi
Hello, I have 400K json messages pulled from Kafka into spark streaming using DirectStream approach. Size of 400K messages is around 5G. Kafka topic is single partitioned. I am using rdd.read.json(_._2) inside foreachRDD to convert rdd into dataframe. It takes almost 2.3 minutes to convert into

RE: Fwd: DF creation

2016-03-19 Thread Diwakar Dhanuskodi
Import sqlContext.implicits._  before  using  df () Sent from Samsung Mobile. Original message From: satyajit vegesna Date:19/03/2016 06:00 (GMT+05:30) To: user@spark.apache.org, d...@spark.apache.org Cc: Subject: Fwd: DF creation Hi , I am

Re: Spark Submit

2016-02-12 Thread Diwakar Dhanuskodi
Try  spark-submit  --conf "spark.executor.memory=512m" --conf "spark.executor.extraJavaOptions=x" --conf "Dlog4j.configuration=log4j.xml" Sent from Samsung Mobile. Original message From: Ted Yu Date:12/02/2016 21:24 (GMT+05:30) To: Ashish Soni

RE: HADOOP_HOME are not set when try to run spark application in yarn cluster mode

2016-02-09 Thread Diwakar Dhanuskodi
loading resource file:/usr/lib/hadoop/etc/hadoop/yarn-site.xml -> hdfs://quickstart.cloudera:8020/user/cloudera/.sparkStaging/application_1455041341343_0002/yarn-site.xml From: Diwakar Dhanuskodi [mailto:diwakar.dhanusk...@gmail.com] Sent: Tuesday, February 09, 2016 10:00 AM To:

Re: AM creation in yarn client mode

2016-02-09 Thread Diwakar Dhanuskodi
Your  2nd assumption  is  correct . There  is  yarn client  which  polls AM while  running  in  yarn client mode  Sent from Samsung Mobile. Original message From: ayan guha Date:10/02/2016 10:55 (GMT+05:30) To: praveen S Cc: user

RE: HADOOP_HOME are not set when try to run spark application in yarn cluster mode

2016-02-09 Thread Diwakar Dhanuskodi
Pass  on  all  hadoop conf files  as  spark-submit parameters in --files Sent from Samsung Mobile. Original message From: Rachana Srivastava Date:09/02/2016 22:53 (GMT+05:30) To: user@spark.apache.org Cc: Subject: HADOOP_HOME are not

Re: Kafka directsream receiving rate

2016-02-08 Thread Diwakar Dhanuskodi
Samsung Mobile. Original message From: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Date:07/02/2016 01:39 (GMT+05:30) To: Cody Koeninger <c...@koeninger.org> Cc: user@spark.apache.org Subject: Re: Kafka directsream receiving rate Thanks Cody for trying to

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Diwakar Dhanuskodi
Fanoos,  Where  you  want the solution to  be deployed ?. On premise or cloud? Regards  Diwakar . Sent from Samsung Mobile. Original message From: "Yuval.Itzchakov" Date:07/02/2016 19:38 (GMT+05:30) To: user@spark.apache.org Cc: Subject: Re: Apache

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Diwakar Dhanuskodi
الليثي <dev.fano...@gmail.com> Date:08/02/2016 02:07 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc: "Yuval.Itzchakov" <yuva...@gmail.com>, user <user@spark.apache.org> Subject: Re: Apache Spark data locality when integrating with Kafka Diw

Re: Kafka directsream receiving rate

2016-02-06 Thread Diwakar Dhanuskodi
Again why only 150K. Any clarification is much appreciated on directStream processing millions per batch . Sent from Samsung Mobile. Original message ---- From: Cody Koeninger <c...@koeninger.org> Date:06/02/2016 01:30 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.

RE: Apache Spark data locality when integrating with Kafka

2016-02-06 Thread Diwakar Dhanuskodi
Yes . To  reduce  network  latency . Sent from Samsung Mobile. Original message From: fanooos Date:07/02/2016 09:24 (GMT+05:30) To: user@spark.apache.org Cc: Subject: Apache Spark data locality when integrating with Kafka Dears If I will use

Re: Kafka directsream receiving rate

2016-02-06 Thread Diwakar Dhanuskodi
Mobile. Original message From: Cody Koeninger <c...@koeninger.org> Date:06/02/2016 01:30 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc: user@spark.apache.org Subject: Re: Kafka directsream receiving rate Have you tried just printing each message, to see w

Re: Kafka directsream receiving rate

2016-02-05 Thread Diwakar Dhanuskodi
er.jar /root/Jars/sparkreceiver.jar Sent from Samsung Mobile. Original message From: Cody Koeninger <c...@koeninger.org> Date:05/02/2016 22:07 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc: user@spark.apache.org Subject: Re: Kaf

Re: Kafka directsream receiving rate

2016-02-05 Thread Diwakar Dhanuskodi
<c...@koeninger.org> Date:06/02/2016 00:33 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc: user@spark.apache.org Subject: Re: Kafka directsream receiving rate How are you counting the number of messages? I'd go ahead and remove the settings for

RE: Kafka directsream receiving rate

2016-02-04 Thread Diwakar Dhanuskodi
message From: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Date:05/02/2016 07:33 (GMT+05:30) To: user@spark.apache.org Cc: Subject: Kafka directsream receiving rate Hi, Using spark 1.5.1. I have a topic with 20 partitions. When I publish 100 messages. Spark direct

kafkaDirectStream usage error

2016-02-04 Thread Diwakar Dhanuskodi
.unsafe.enabled=false" --conf "spark.streaming.backpressure.enabled=true" --conf "spark.locality.wait=1s" --conf "spark.shuffle.consolidateFiles=true"   --driver-memory 2g --executor-memory 1g --class com.tcs.dime.spark.SparkReceiver   --files /etc/hadoop/