Again why only 150K.
Any clarification is much appreciated on directStream processing millions
per batch .
Sent from Samsung Mobile.
Original message ----
From: Cody Koeninger <c...@koeninger.org>
Date:06/02/2016 01:30 (GMT+05:30)
To: Diwakar Dhanuskodi <diwakar.
Samsung Mobile.
Original message From: Diwakar Dhanuskodi
<diwakar.dhanusk...@gmail.com> Date:07/02/2016 01:39 (GMT+05:30)
To: Cody Koeninger <c...@koeninger.org> Cc:
user@spark.apache.org Subject: Re: Kafka directsream receiving rate
Thanks Cody for trying to
Yes . To reduce network latency .
Sent from Samsung Mobile.
Original message From: fanooos
Date:07/02/2016 09:24 (GMT+05:30)
To: user@spark.apache.org Cc: Subject: Apache
Spark data locality when integrating with Kafka
Dears
If I will use
Mobile.
Original message From: Cody Koeninger
<c...@koeninger.org> Date:06/02/2016 01:30 (GMT+05:30)
To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc:
user@spark.apache.org Subject: Re: Kafka directsream receiving rate
Have you tried just printing each message, to see w
Fanoos,
Where you want the solution to be deployed ?. On premise or cloud?
Regards
Diwakar .
Sent from Samsung Mobile.
Original message From: "Yuval.Itzchakov"
Date:07/02/2016 19:38 (GMT+05:30)
To: user@spark.apache.org Cc: Subject: Re:
Apache
er.jar
/root/Jars/sparkreceiver.jar
Sent from Samsung Mobile.
Original message From: Cody Koeninger
<c...@koeninger.org> Date:05/02/2016 22:07 (GMT+05:30)
To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc:
user@spark.apache.org Subject: Re: Kaf
message From: Diwakar Dhanuskodi
<diwakar.dhanusk...@gmail.com> Date:05/02/2016 07:33 (GMT+05:30)
To: user@spark.apache.org Cc: Subject: Kafka
directsream receiving rate
Hi,
Using spark 1.5.1.
I have a topic with 20 partitions. When I publish 100 messages. Spark direct
.unsafe.enabled=false" --conf
"spark.streaming.backpressure.enabled=true" --conf "spark.locality.wait=1s"
--conf "spark.shuffle.consolidateFiles=true" --driver-memory 2g
--executor-memory 1g --class com.tcs.dime.spark.SparkReceiver --files
/etc/hadoop/
<c...@koeninger.org> Date:06/02/2016 00:33 (GMT+05:30)
To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc:
user@spark.apache.org Subject: Re: Kafka directsream receiving rate
How are you counting the number of messages?
I'd go ahead and remove the settings for
Try
spark-submit --conf "spark.executor.memory=512m" --conf
"spark.executor.extraJavaOptions=x" --conf "Dlog4j.configuration=log4j.xml"
Sent from Samsung Mobile.
Original message From: Ted Yu
Date:12/02/2016 21:24 (GMT+05:30)
To: Ashish Soni
loading resource
file:/usr/lib/hadoop/etc/hadoop/yarn-site.xml ->
hdfs://quickstart.cloudera:8020/user/cloudera/.sparkStaging/application_1455041341343_0002/yarn-site.xml
From: Diwakar Dhanuskodi [mailto:diwakar.dhanusk...@gmail.com]
Sent: Tuesday, February 09, 2016 10:00 AM
To:
Your 2nd assumption is correct .
There is yarn client which polls AM while running in yarn client mode
Sent from Samsung Mobile.
Original message From: ayan guha
Date:10/02/2016 10:55 (GMT+05:30)
To: praveen S Cc: user
Pass on all hadoop conf files as spark-submit parameters in --files
Sent from Samsung Mobile.
Original message From: Rachana Srivastava
Date:09/02/2016 22:53
(GMT+05:30) To: user@spark.apache.org Cc:
Subject: HADOOP_HOME are not
الليثي
<dev.fano...@gmail.com> Date:08/02/2016 02:07 (GMT+05:30)
To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc:
"Yuval.Itzchakov" <yuva...@gmail.com>, user <user@spark.apache.org>
Subject: Re: Apache Spark data locality when integrating with Kafka
Diw
Import sqlContext.implicits._ before using df ()
Sent from Samsung Mobile.
Original message From: satyajit vegesna
Date:19/03/2016 06:00 (GMT+05:30)
To: user@spark.apache.org, d...@spark.apache.org Cc:
Subject: Fwd: DF creation
Hi ,
I am
Cody Koeninger <c...@koeninger.org> wrote:
> Those logs you're posting are from right after your failure, they don't
> include what actually went wrong when attempting to read json. Look at your
> logs more carefully.
> On Aug 10, 2016 2:07 AM, "Diwakar Dhanuskodi" <
Figured it out. All I am doing wrong is testing it out in pseudo node vm
with 1 core. The tasks were hanging out for cpu.
In production cluster this works just fine.
On Thu, Aug 11, 2016 at 12:45 AM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:
> Checked executor lo
the kafka partitions?
> If you use kafka-simple-consumer-shell.sh to read that partition, do
> you get any data?
>
> On Wed, Aug 10, 2016 at 9:40 AM, Diwakar Dhanuskodi
> <diwakar.dhanusk...@gmail.com> wrote:
> > Hi Cody,
> >
> > Just added zookeeper.connect
ntext and just do
>
> rdd => {
> rdd.foreach(println)
>
>
> as a base line to see if you're reading the data you expect
>
> On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi
> <diwakar.dhanusk...@gmail.com> wrote:
> > Hi,
> >
> > I am reading jso
dd => {
> if (rdd.isEmpty()) {
> println("Failed to get data from Kafka. Please check that the Kafka
> producer is streaming data.")
> System.exit(-1)
> }
> val sqlContext = org.apache.spark.sql.SQLContext.getOrCreate(rdd.
> sparkContext)
> val wea
at a time to the working
> rdd.foreach example and see when it stops working, then take a closer
> look at the logs.
>
>
> On Tue, Aug 9, 2016 at 10:20 PM, Diwakar Dhanuskodi
> <diwakar.dhanusk...@gmail.com> wrote:
> > Hi Cody,
> >
> > Without conditional .
Added jobs for time 147081258 ms
16/08/10 12:34:00 INFO JobScheduler: Added jobs for time 147081264 ms
On Wed, Aug 10, 2016 at 10:26 AM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:
> Hi Siva,
>
> Does topic has partitions? which version of Spark you are usi
Hi,
We are using spark 1.6.1 and kafka 0.9.
KafkaUtils.createStream is showing strange behaviour. Though
auto.offset.reset is set to smallest . Whenever we need to restart the
stream it is picking up the latest offset which is not expected.
Do we need to set any
From: Cody Koeninger
<c...@koeninger.org> Date:12/08/2016 23:42 (GMT+05:30)
To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>,
user@spark.apache.org Cc: Subject: Re:
KafkaUtils.createStream not picking smallest offset
Are you checkpointing?
Beyond that, why are you using
.
Original message From: Martin Eden
<martineden...@gmail.com> Date:16/07/2016 14:01 (GMT+05:30)
To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc:
user <user@spark.apache.org> Subject: Re: Spark streaming takes
longer time to read json into dataframes
Hi
Mobile.
Original message From: Cody Koeninger
<c...@koeninger.org> Date:19/07/2016 20:49 (GMT+05:30)
To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc:
Martin Eden <martineden...@gmail.com>, user <user@spark.apache.org>
Subject: Re: Spark str
Mobile.
Original message From: Cody Koeninger
<c...@koeninger.org> Date:19/07/2016 20:49 (GMT+05:30)
To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc:
Martin Eden <martineden...@gmail.com>, user <user@spark.apache.org>
Subject: Re: Spark str
Hello,
I have 400K json messages pulled from Kafka into spark streaming using
DirectStream approach. Size of 400K messages is around 5G. Kafka topic is
single partitioned. I am using rdd.read.json(_._2) inside foreachRDD to
convert rdd into dataframe. It takes almost 2.3 minutes to convert into
-- Forwarded message --
From: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>
Date: Sat, Jul 16, 2016 at 9:30 AM
Subject: Re: Spark streaming takes longer time to read json into dataframes
To: Jean Georges Perrin <j...@jgp.net>
Hello,
I need it on memory. Increa
Hi,
I am reading json messages from kafka . Topics has 2 partitions. When
running streaming job using spark-submit, I could see that * val dataFrame
= sqlContext.read.json(rdd.map(_._2)) *executes indefinitely. Am I doing
something wrong here. Below is code .This environment is cloudera sandbox
We are using createDirectStream api to receive messages from 48
partitioned topic. I am setting up --num-executors 48 & --executor-cores
1 in spark-submit
All partitions were parallely received and corresponding RDDs in
foreachRDD were executed in parallel. But when I join a transformed RDD
Hi,
Is there a way to specify in createDirectStream to receive only last 'n'
offsets of a specific topic and partition. I don't want to filter out in
foreachRDD.
Sent from Samsung Mobile.
Hi,
There is a RDD with json data. I could read json data using rdd.read.json .
The json data has XML data in couple of key-value paris.
Which is the best method to read and parse XML from rdd. Is there any
specific xml libraries for spark. Could anyone help on this.
Thanks.
to parse
XML from 2 million messages in a 3 nodes 100G 4 cpu each environment.
Sent from Samsung Mobile.
Original message From: Felix Cheung
<felixcheun...@hotmail.com> Date:20/08/2016 09:49 (GMT+05:30)
To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>,
, Aug 22, 2016 at 4:34 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> Do you mind share your codes and sample data? It should be okay with
> single XML if I remember this correctly.
>
> 2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi <
> diwakar.dhanusk...@gmail.com>:
>
.@gmail.com> Cc: Diwakar Dhanuskodi
<diwakar.dhanusk...@gmail.com>, Felix Cheung <felixcheun...@hotmail.com>, user
<user@spark.apache.org> Subject: Re: Best way to read XML data from
RDD
Another option would be to look at spark-xml-utils. We use this
extensively in the man
(GMT+05:30)
To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc:
Felix Cheung <felixcheun...@hotmail.com>, user <user@spark.apache.org>
Subject: Re: Best way to read XML data from RDD
I fear the issue is that this will create and destroy a XML parser object
2
: Jörn Franke <jornfra...@gmail.com>
Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, Felix Cheung
<felixcheun...@hotmail.com>, user <user@spark.apache.org>
Subject: Re: Best way to read XML data from RDD
Hi Diwakar,
Spark XML library can take RDD as sourc
:44 AM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:
> Hi,
>
> java version 7
>
> mvn command
> ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Phive
> -Phive-thriftserver -Pyarn -Dmaven.version=3.0.4
>
>
> yes, I executed script to c
park-sql_2.11
On Thu, Sep 1, 2016 at 8:23 AM, Divya Gehlot <divya.htco...@gmail.com>
wrote:
> Which java version are you using ?
>
> On 31 August 2016 at 04:30, Diwakar Dhanuskodi <
> diwakar.dhanusk...@gmail.com> wrote:
>
>> Hi,
>>
>> While bui
a
>
> On Sat 3 Sep, 2016, 12:14 PM Diwakar Dhanuskodi, <
> diwakar.dhanusk...@gmail.com> wrote:
>
>> Hi,
>>
>> Just re-ran again without killing zinc server process
>>
>> /make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Phive
>>
Hi,
I recently built spark using maven. Now when starting spark-shell, it
couldn't connect hive and getting below error
I couldn't find datanucleus jar in built library. But datanucleus jar is
available in hive/lib folders.
java.lang.ClassNotFoundException:
Please run with -X and post logs here. We can get exact error from it.
On Sat, Sep 3, 2016 at 7:24 PM, Marco Mistroni wrote:
> hi all
>
> i am getting failures when building spark 2.0 on Ubuntu 16.06
> Here's details of what i have installed on the ubuntu host
> - java 8
Hi,
While building Spark 1.6.2 , getting below error in spark-sql. Much
appreciate for any help.
ERROR] missing or invalid dependency detected while loading class file
'WebUI.class'.
Could not access term eclipse in package org,
because it (or its dependencies) are missing. Check your build
Hi
Can you cross check by providing same library path in --jars of spark-submit
and run .
Sent from Samsung Mobile.
Original message From: "颜发才(Yan Facai)"
Date:18/08/2016 15:17 (GMT+05:30)
To: "user.spark" Cc:
Subject:
Just wanted to clarify!!!
Is foreachPartition in spark an output operation?
Which one is better use mapPartitions or foreachPartitions?
Regards
Diwakar
Hi ,
Could it be possible to setup Spark within Yarn cluster which may not have
Hadoop?.
Thanks.
need to use
> other components of the Hadoop cluster, namely MapReduce and HDFS.
>
> That being said, if you just need cluster scheduling and not using
> MapReduce nor HDFS it is possible you will be fine with the Spark
> Standalone cluster.
>
> Regards,
> Juan Martín
48 matches
Mail list logo