Yes as long as there are 3 cores available on your local machine.
On Fri, Apr 20, 2018 at 10:56 AM karthikjay wrote:
> I have the following code to read data from Kafka topic using the
> structured
> streaming. The topic has 3 partitions:
>
> val spark = SparkSession
>
Just wondering if anyone has tried spark structured streaming kafka
connector (2.2) with Kafka 0.11 or Kafka 1.0 version
Thanks
Raghav
I am using structured streaming to evaluate multiple rules on same running
stream.
I have two options to do that. One is to use forEach and evaluate all the
rules on the row..
The other option is to express rules in spark sql dsl and run multiple
queries.
I was wondering if option 1 will result in
Did you try SparkContext.addSparkListener?
On Aug 3, 2017 1:54 AM, "Andrii Biletskyi"
wrote:
> Hi all,
>
> What is the correct way to schedule multiple jobs inside foreachRDD method
> and importantly await on result to ensure those jobs have completed
>
When you start a slave you pass address of master as a parameter. That
slave will contact master and register itself.
On Jan 25, 2017 4:12 AM, "kant kodali" wrote:
> Hi,
>
> How do I dynamically add nodes to spark standalone cluster and be able to
> discover them? Does
ippet ?
>>
>>
>>
>> On Tue, Jan 10, 2017 8:15 PM, Georg Heiler georg.kf.hei...@gmail.com
>> wrote:
>>
>> Maybe you can create an UDF?
>>
>> Raghavendra Pandey <raghavendra.pan...@gmail.com> schrieb am Di., 10.
>> Jan. 2017 um 20:04 Uh
I have of around 41 level of nested if else in spark sql. I have programmed
it using apis on dataframe. But it takes too much time.
Is there anything I can do to improve on time here?
How large is your first text file? The idea is you read first text file and
if it is not large you can collect all the lines on driver and then again
read text files for each line and union all rdds.
On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" wrote:
> Just wonder if this
Default implementation is to add milliseconds. For mesos it is
framework-id. If you are using mesos, you can assume that your framework id
used to register your app is same as app-id.
As you said, you have a system application to schedule spark jobs, you can
keep track of framework-ids submitted
Did you try passing them in spark-env.sh?
On Sat, Sep 3, 2016 at 2:42 AM, Eric Ho wrote:
> I'm trying to pass a trustStore pathname into pyspark.
> What env variable and/or config file or script I need to change to do this
> ?
> I've tried setting JAVA_OPTS env var but to
If your file format is splittable say TSV, CSV etc, it will be distributed
across all executors.
On Sat, Sep 3, 2016 at 3:38 PM, Somasundaram Sekar <
somasundar.se...@tigeranalytics.com> wrote:
> Hi All,
>
>
>
> Would like to gain some understanding on the questions listed below,
>
>
>
> 1.
Kapil -- I afraid you need to plugin your own SessionCatalog as
ResolveRelations class depends on that. To keep up with consistent design
you may like to implement ExternalCatalog as well.
You can also look to plug in your own Analyzer class to give your more
flexibility. Ultimately that is where
Can you please send me as well.
Thanks
Raghav
On 12 May 2016 20:02, "Tom Ellis" wrote:
> I would like to also Mich, please send it through, thanks!
>
> On Thu, 12 May 2016 at 15:14 Alonso Isidoro wrote:
>
>> Me too, send me the guide.
>>
>> Enviado desde
Even though it does not sound intuitive, reduce by key expects all values
for a particular key for a partition to be loaded into memory. So once you
increase the partitions you can run the jobs.
org/docs/latest/running-on-mesos.html> *
>>
>> On Wed, May 11, 2016 at 10:05 PM, Giri P <gpatc...@gmail.com> wrote:
>>
>>> I'm not using docker
>>>
>>> On Wed, May 11, 2016 at 8:47 AM, Raghavendra Pandey <
>>> raghavendra.pan...@
By any chance, are you using docker to execute?
On 11 May 2016 21:16, "Raghavendra Pandey" <raghavendra.pan...@gmail.com>
wrote:
> On 11 May 2016 02:13, "gpatcham" <gpatc...@gmail.com> wrote:
>
> >
>
> > Hi All,
> >
> > I'm u
On 11 May 2016 02:13, "gpatcham" wrote:
>
> Hi All,
>
> I'm using --jars option in spark-submit to send 3rd party jars . But I
don't
> see they are actually passed to mesos slaves. Getting Noclass found
> exceptions.
>
> This is how I'm using --jars option
>
> --jars
You can create a column with count of /. Then take max of it and create
that many columns for every row with null fillers.
Raghav
On 11 May 2016 20:37, "Bharathi Raja" wrote:
Hi,
I have a dataframe column col1 with values something like
U can use date type...
On Dec 29, 2015 9:02 AM, "Divya Gehlot" wrote:
> Hi,
> I am newbee to Spark ,
> My appologies for such a naive question
> I am using Spark 1.4.1 and wrtiting code in scala . I have input data as
> CSVfile which I am parsing using spark-csv package
Can ypu increase number of partitions and try... Also, i dont think you
need to cache dfs before saving them... U can do away with that as well...
Raghav
On Oct 23, 2015 7:45 AM, "Ram VISWANADHA"
wrote:
> Hi ,
> I am trying to load 931MB file into an RDD, then
You may like to look at spark job server.
https://github.com/spark-jobserver/spark-jobserver
Raghavendra
You can use coalesce function, if you want to reduce the number of
partitions. This one minimizes the data shuffle.
-Raghav
On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri
wrote:
> Hi folks
>
> I need to reparation large set of data around(300G) as i see some portions
>
You can add classpath info in hadoop env file...
Add the following line to your $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export
HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*
Add the following line to $SPARK_HOME/conf/spark-env.sh
export
Here is a quick code sample I can come up with :
case class Input(ID:String, Name:String, PhoneNumber:String, Address:
String)
val df = sc.parallelize(Seq(Input("1", "raghav", "0123456789",
"houseNo:StreetNo:City:State:Zip"))).toDF()
val formatAddress = udf { (s: String) =>
You can use spark 1.5.1 with no hadoop and hadoop 2.7.1..
Hadoop 2.7.1 is more mature for s3a access. You also need to set hadoop
tools dir into hadoop classpath...
Raghav
On Oct 16, 2015 1:09 AM, "Scott Reynolds" wrote:
> We do not use EMR. This is deployed on Amazon VMs
Looks like you are facing ipv6 issue. Can you try using preferIPv4 property
on.
On Oct 15, 2015 2:10 AM, "Steve Loughran" wrote:
>
> On 14 Oct 2015, at 20:56, Marco Mistroni wrote:
>
>
> 15/10/14 20:52:35 WARN : Your hostname, MarcoLaptop resolves to
I fixed these timeout errors by retrying...
On Oct 15, 2015 3:41 AM, "Kartik Mathur" wrote:
> Hi,
>
> I have some nightly jobs which runs every night but dies sometimes because
> of unresponsive master , spark master logs says -
>
> Not seeing much else there , what could
There is spark without hadoop version.. You can use that to link with any
custom hadoop version.
Raghav
On Oct 10, 2015 5:34 PM, "Steve Loughran" wrote:
>
> During development, I'd recommend giving Hadoop a version ending with
> -SNAPSHOT, and building spark with maven,
, Atul Kulkarni <atulskulka...@gmail.com>
wrote:
> I am submitting the job with yarn-cluster mode.
>
> spark-submit --master yarn-cluster ...
>
> On Thu, Sep 10, 2015 at 7:50 PM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
>> What is the
In mr jobs, the output is sorted only within reducer.. That can be better
emulated by sorting each partition of rdd rather than total sorting the
rdd..
In Rdd.mapPartition you can sort the data in one partition and try...
On Sep 11, 2015 7:36 AM, "周千昊" wrote:
> Hi, all
>
What is the value of spark master conf.. By default it is local, that means
only one thread can run and that is why your job is stuck.
Specify it local[*], to make thread pool equal to number of cores...
Raghav
On Sep 11, 2015 6:06 AM, "Atul Kulkarni" wrote:
> Hi Folks,
Did you specify partitioning column while saving data..
On Sep 3, 2015 5:41 AM, "Kohki Nishio" wrote:
> Hello experts,
>
> I have a huge json file (> 40G) and trying to use Parquet as a file
> format. Each entry has a unique identifier but other than that, it doesn't
> have
You can increase number of partitions n try...
On Sep 3, 2015 5:33 AM, "Silvio Fiorito"
wrote:
> Unfortunately, groupBy is not the most efficient operation. What is it
> you’re trying to do? It may be possible with one of the other *byKey
> transformations.
>
>
Flatmap is just like map but it flattens out the seq output of the
closure...
In your case, you call "to" function that is to return list...
a.to(b) returns list(a,...,b)
So rdd.flatMap( x => x.to(3)) will take all element and return range upto
3..
On Sep 3, 2015 7:36 AM, "Ashish Soni"
Looks like ur version n spark's Jackson package are at different versions.
Raghav
On Aug 28, 2015 4:01 PM, Manohar753 manohar.re...@happiestminds.com
wrote:
Hi Team,
I upgraded spark older versions to 1.4.1 after maven build i tried to ran
my
simple application but it failed and giving the
So either you empty line at the end or when you use string.split you dont
specify -1 as second parameter...
On Aug 29, 2015 1:18 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
I suspect in the last scenario you are having an empty new line at the
last line. If you put a try..catch you'd
We are using Cassandra for similar kind of problem and it works well... You
need to take care of race condition between updating the store and looking
up the store...
On Aug 29, 2015 1:31 AM, Ted Yu yuzhih...@gmail.com wrote:
+1 on Jason's suggestion.
bq. this large variable is broadcast many
Did you try increasing sql partitions?
On Tue, Aug 25, 2015 at 11:06 AM, kundan kumar iitr.kun...@gmail.com
wrote:
I am running this query on a data size of 4 billion rows and
getting org.apache.spark.shuffle.FetchFailedException error.
select adid,position,userid,price
from (
select
...@gmail.com wrote:
spark-env.sh works for me in Spark 1.4 but not
spark.executor.extraJavaOptions.
On Sun, Aug 23, 2015 at 11:27 AM Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
I think the only way to pass on environment variables to worker node is
to write it in spark-env.sh file
I think the only way to pass on environment variables to worker node is to
write it in spark-env.sh file on each worker node.
On Sun, Aug 23, 2015 at 8:16 PM, Hemant Bhanawat hemant9...@gmail.com
wrote:
Check for spark.driver.extraJavaOptions and
spark.executor.extraJavaOptions in the
You get the list of all the persistet rdd using spark context...
On Aug 21, 2015 12:06 AM, Rishitesh Mishra rishi80.mis...@gmail.com
wrote:
I am not sure if you can view all RDDs in a session. Tables are maintained
in a catalogue . Hence its easier. However you can see the DAG
representation
Did you try with hadoop version 2.7.1 .. It is known that s3a works really
well with parquet which is available in 2.7. They fixed lot of issues
related to metadata reading there...
On Aug 21, 2015 11:24 PM, Jerrick Hoang jerrickho...@gmail.com wrote:
@Cheng, Hao : Physical plans show that it
Impact,
You can group by the data and then sort it by timestamp and take max to
select the oldest value.
On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote:
I am also looking for a way to achieve the reducebykey functionality on
data
frames. In my case I need to select one particular row
I think you can try dataFrame create api that takes RDD[Row] and Struct
type...
On Aug 11, 2015 4:28 PM, Jyun-Fan Tsai jft...@appier.com wrote:
Hi all,
I'm using Spark 1.4.1. I create a DataFrame from json file. There is
a column C that all values are null in the json file. I found that
In spark 1.4 there is a parameter to control that. Its default value is 10
M. So you need to cache your dataframe to hint the size.
On Aug 14, 2015 7:09 PM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io
wrote:
Hi
I am facing huge performance problem when I am trying to left outer join
very big
You can use struct function of org.apache.spark.sql.function class to
combine two columns to create struct column.
Sth like.
val nestedCol = struct(df(d), df(e))
df.select(df(a), df(b), df(c), nestedCol)
On Aug 7, 2015 3:14 PM, Rishabh Bhardwaj rbnex...@gmail.com wrote:
I am doing it by creating
I have a complex transformation requirements that i m implementing using
dataframe. It involves lot of joins also with Cassandra table.
I was wondering how can I debug the jobs n stages queued by spark sql the
way I can do for Rdds.
In one of cases, spark sql creates more than 17 lakhs tasks for
Hello,
I would like to add a column of StructType to DataFrame.
What would be the best way to do it? Not sure if it is possible using
withColumn. A possible way is to convert the dataframe into a RDD[Row], add
the struct and then convert it back to dataframe. But that seems an
overkill.
Please
If you cache rdd it will save some operations. But anyway filter is a lazy
operation. And it runs based on what you will do later on with rdd1 and
rdd2...
Raghavendra
On Jul 16, 2015 1:33 PM, Bin Wang wbi...@gmail.com wrote:
If I write code like this:
val rdd = input.map(_.value)
val f1 =
if I would use both rdd1 and rdd2 later?
Raghavendra Pandey raghavendra.pan...@gmail.com于2015年7月16日周四 下午4:08写道:
If you cache rdd it will save some operations. But anyway filter is a
lazy operation. And it runs based on what you will do later on with rdd1
and rdd2...
Raghavendra
On Jul 16, 2015
Why dont you apply filter first and then Group the data and run
aggregations..
On Jul 3, 2015 1:29 PM, Megha Sridhar- Cynepia megha.sridh...@cynepia.com
wrote:
Hi,
I have a Spark DataFrame object, which when trimmed, looks like,
FromTo Subject
You cannot update the broadcasted variable.. It wont get reflected on
workers.
On Jul 3, 2015 12:18 PM, James Cole ja...@binarism.net wrote:
Hi all,
I'm filtering a DStream using a function. I need to be able to change this
function while the application is running (I'm polling a service to
This is the basic design of spark that it runs all actions in different
stages...
Not sure you can achieve what you r looking for.
On Jul 3, 2015 12:43 PM, Marius Danciu marius.dan...@gmail.com wrote:
Hi all,
If I have something like:
rdd.join(...).mapPartitionToPair(...)
It looks like
I dont think you can expect any order guarantee except the records in one
partition.
On Jul 4, 2015 7:43 AM, khaledh khal...@gmail.com wrote:
I'm writing a Spark Streaming application that uses RabbitMQ to consume
events. One feature of RabbitMQ that I intend to make use of is bulk ack of
This will not work i.e. using data frame inside map function..
Although you can try to create df separately n cache it...
Then you can join your event stream with this df.
On Jul 2, 2015 6:11 PM, Ashish Soni asoni.le...@gmail.com wrote:
Hi All ,
I have and Stream of Event coming in and i want
will be the best way to
do it,
Ashish
On Wed, Jul 1, 2015 at 10:45 PM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
You cannot refer to one rdd inside another rdd.map function...
Rdd object is not serialiable. Whatever objects you use inside map
function should be serializable
So do you want to change the behavior of persist api or write the rdd on
disk...
On Jul 1, 2015 9:13 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I think i want to use persist then and write my intermediate RDDs to
disk+mem.
On Wed, Jul 1, 2015 at 8:28 AM, Raghavendra Pandey
raghavendra.pan
By any chance, are you using time field in your df. Time fields are known
to be notorious in rdd conversion.
On Jul 1, 2015 6:13 PM, Pooja Jain pooja.ja...@gmail.com wrote:
Join is happening successfully as I am able to do count() after the join.
Error is coming only while trying to write in
:
i original assumed that persisting is similar to writing. But its not.
Hence i want to change the behavior of intermediate persists.
On Wed, Jul 1, 2015 at 8:46 AM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
So do you want to change the behavior of persist api or write the rdd
I am using spark driver as a rest service. I used spray.io to make my app
rest server.
I think this is a good design for applications that you want to keep in
long running mode..
On Jul 1, 2015 6:28 PM, Arush Kharbanda ar...@sigmoidanalytics.com
wrote:
You can try using Spark Jobserver
I am not sure if you can broadcast data frame without collecting it on
driver...
On Jul 1, 2015 11:45 PM, Ashish Soni asoni.le...@gmail.com wrote:
Hi ,
I need to load 10 tables in memory and have them available to all the
workers , Please let me me know what is the best way to do broadcast
You cannot refer to one rdd inside another rdd.map function...
Rdd object is not serialiable. Whatever objects you use inside map
function should be serializable as they get transferred to executor nodes.
On Jul 2, 2015 6:13 AM, Ashish Soni asoni.le...@gmail.com wrote:
Hi All ,
I am not sure
I am facing the same issue. FAIR and FIFO behaving in the same way.
On Wed, Apr 1, 2015 at 1:49 AM, asadrao as...@microsoft.com wrote:
Hi, I am using the Spark ‘fair’ scheduler mode. I have noticed that if the
first query is a very expensive query (ex: ‘select *’ on a really big data
set)
Can you try increasing the ulimit -n on your machine.
On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier mps@gmail.com wrote:
Hi Sameer,
I’m still using Spark 1.1.1, I think the default is hash shuffle. No
external shuffle service.
We are processing gzipped JSON files, the partitions are
Even I observed the same issue.
On Fri, Feb 6, 2015 at 12:19 AM, Praveen Garg praveen.g...@guavus.com
wrote:
Hi,
While moving from spark 1.1 to spark 1.2, we are facing an issue where
Shuffle read/write has been increased significantly. We also tried running
the job by rolling back to
You can also do something like
rdd.sparkContext.runJob(rdd,(iter: Iterator[T]) = {
while(iter.hasNext) iter.next()
})
On Sat, Jan 31, 2015 at 5:24 AM, Sean Owen so...@cloudera.com wrote:
Yeah, from an unscientific test, it looks like the time to cache the
blocks still dominates. Saving
You can use Hadoop Client Api to remove files
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#delete(org.apache.hadoop.fs.Path,
boolean). I don't think spark has any wrapper on hadoop filesystem APIs.
On Thu, Jan 22, 2015 at 12:15 PM, LinQili lin_q...@outlook.com
If you pass spark master URL to spark-submit, you don't need to pass the
same to SparkConf object. You can create SparkConf without this property or
for that matter any other property that you pass in spark-submit.
On Sun, Jan 18, 2015 at 7:38 AM, guxiaobo1982 guxiaobo1...@qq.com wrote:
Hi,
If you are running spark in local mode, executor parameters are not used as
there is no executor. You should try to set corresponding driver parameter
to effect it.
On Mon, Jan 19, 2015, 00:21 Sean Owen so...@cloudera.com wrote:
OK. Are you sure the executor has the memory you think? -Xmx24g in
I believe the default hash partitioner logic in spark will send all the
same keys to same machine.
On Wed, Jan 14, 2015, 03:03 Puneet Kapoor puneet.cse.i...@gmail.com wrote:
Hi,
I have a usecase where in I have hourly spark job which creates hourly
RDDs, which are partitioned by keys.
At
You can take a look at http://zeppelin.incubator.apache.org. it is a
notebook and graphic visual designer.
On Sun, Jan 11, 2015, 01:45 Cui Lin cui@hds.com wrote:
Thanks, Gaurav and Corey,
Probably I didn’t make myself clear. I am looking for best Spark
practice similar to Shiny for R,
the saveAsParquetFile method to allows me to
persist the avro schema inside parquet. Is this possible at all?
Best Regards,
Jerry
On Fri, Jan 9, 2015 at 2:05 AM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
I cam across this
http://zenfractal.com/2013/08/21/a-powerful-big
You may like to look at spark.cleaner.ttl configuration which is infinite
by default. Spark has that configuration to delete temp files time to time.
On Fri Jan 09 2015 at 8:34:10 PM michael.engl...@nomura.com wrote:
Hi,
Is there a way of automatically cleaning up the spark.local.dir after
I cam across this http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/.
You can take a look.
On Fri Jan 09 2015 at 12:08:49 PM Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
I have the similar kind of requirement where I want to push avro data into
parquet. But it seems you have
I have the similar kind of requirement where I want to push avro data into
parquet. But it seems you have to do it on your own. There is parquet-mr
project that uses hadoop to do so. I am trying to write a spark job to do
similar kind of thing.
On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam
You can also use join function of rdd. This is actually kind of append
funtion that add up all the rdds and create one uber rdd.
On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote:
Thank you for the response, sure will try that out.
Currently I changed my code such that the first
of Complex
objects and hence it slows down the performance. Here's a link
http://spark.apache.org/docs/latest/tuning.html to performance tuning
if you haven't seen it already.
Thanks
Best Regards
On Wed, Dec 31, 2014 at 8:45 AM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
I
Why don't you push \n instead of \t in your first transformation [
(fields(0),(fields(1)+\t+fields(3)+\t+fields(5)+\t+fields(7)+\t
+fields(9)))] and then do saveAsTextFile?
-Raghavendra
On Wed Dec 31 2014 at 1:42:55 PM Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid wrote:
hey guys
My
I have a spark app that involves series of mapPartition operations and then
a keyBy operation. I have measured the time inside mapPartition function
block. These blocks take trivial time. Still the application takes way too
much time and even sparkUI shows that much time.
So i was wondering where
It seems there is hadoop 1 somewhere in the path.
On Fri, Dec 19, 2014, 21:24 Sean Owen so...@cloudera.com wrote:
Yes, but your error indicates that your application is actually using
Hadoop 1.x of some kind. Check your dependencies, especially
hadoop-client.
On Fri, Dec 19, 2014 at 2:11
Yeah, the dot notation works. It works even for arrays. But I am not sure
if it can handle complex hierarchies.
On Mon Dec 08 2014 at 11:55:19 AM Cheng Lian lian.cs@gmail.com wrote:
You may access it via something like SELECT filterIp.element FROM tb,
just like Hive. Or if you’re using
You don't need to worry about locks as such as one thread/worker is
responsible exclusively for one partition of the RDD. You can use
Accumulator variables that spark provides to get the state updates.
On Mon Dec 08 2014 at 8:14:28 PM aditya.athalye adbrihadarany...@gmail.com
wrote:
I am
82 matches
Mail list logo