Re: conver panda image column to spark dataframe

2023-08-03 Thread Adrian Pop-Tifrea
Hello, can you also please show us how you created the pandas dataframe? I mean, how you added the actual data into the dataframe. It would help us for reproducing the error. Thank you, Pop-Tifrea Adrian On Mon, Jul 31, 2023 at 5:03 AM second_co...@yahoo.com < second_co...@yahoo.com>

Re: conver panda image column to spark dataframe

2023-07-27 Thread Adrian Pop-Tifrea
Hello, when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images? Because if that's the case, then you'd want ro only use 3 layers of ArrayType when you define the schema. Best regards, Adrian On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID wrote: >

bucket joins on multiple data frames.

2021-09-08 Thread Adrian Stern
ey", "full") - joined.write-bucketed() ; joined = spark.table("joined") - joined = joined.join(t3, "key", "full") I'm wondering if there is a way to get performance gains here, either by using bucketing or some other way. Also courions if this isn't what bucket joins are for, what are they actually for. Thanks Adrian

Re: How does preprocessing fit into Spark MLlib pipeline

2017-05-16 Thread Adrian Stern
2 rows, while the others end up with 3. Am I missing a clean way to do this in a pipeline? Maybe there is a way to reduce the data, in the beginning, to make it so I can use transforms. Maybe by grouping all the events together in a list per user. Keep in mind in my actual use case I have

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Adrian Bridgett
Since it's pyspark it's just using the default hash partitioning I believe. Trying a prime number (71 so that there's enough CPUs) doesn't seem to change anything. Out of curiousity why did you suggest that? Googling "spark coalesce prime" doesn't give me any clue :-) Adrian On

coalesce ending up very unbalanced - but why?

2016-12-14 Thread Adrian Bridgett
the size of the final partitions rather than the number of source partitions in each final partition. Thanks for any light you can shine! Adrian - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: mesos in spark 2.0.1 - must call stop() otherwise app hangs

2016-10-05 Thread Adrian Bridgett
Fab thanks all - I'll ensure we fix our code :-) On 05/10/2016 18:10, Sean Owen wrote: Being discussed as we speak at https://issues.apache.org/jira/browse/SPARK-17707 Calling stop() is definitely the right thing to do and always has been (see examples), but, may be possible to get rid of

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Adrian Bridgett
high). Hope this helps. Adrian - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: very high maxresults setting (no collect())

2016-09-22 Thread Adrian Bridgett
the driver) and then we start to hit memory pressure on the driver in the output loop and the job grinds to a crawl (we eventually have to kill it and restart with more memory). Adrian - To unsubscribe e-mail: use

very high maxresults setting (no collect())

2016-09-19 Thread Adrian Bridgett
ite.format("com.databricks.spark.csv").save('/tmp/...') Cheers for any help/pointers! There are a couple of memory leak tickets fixed in v1.6.2 that may affect the driver so I may try an upgrade (the executors are fine). Adrian - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

2.0.1/2.1.x release dates

2016-08-18 Thread Adrian Bridgett
for any info, Adrian - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

coalesce serialising earlier work

2016-08-09 Thread Adrian Bridgett
ant the df to be calculated in parallel and then this is _then_ coalesced before being written. (It may be that the -getmerge approach will still be faster) df.coalesce(100).coalesce(1).write..... doesn't look very likely to help! Adrian -- *Adrian Bridgett*

[REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-03 Thread Adrian Tanase
Hi all, Trying to repost this question from a colleague on my team, somehow his subscription is not active: http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-td27056.html Appreciate any thoughts, -adrian

odd python.PythonRunner Times values?

2016-05-23 Thread Adrian Bridgett
I'm seeing output like this on our mesos spark slaves: 16/05/23 11:44:04 INFO python.PythonRunner: Times: total = 1137, boot = -590, init = 593, finish = 1134 16/05/23 11:44:04 INFO python.PythonRunner: Times: total = 1652, boot = -446, init = 481, finish = 1617 This seems to be coming from

Re: Worker's BlockManager Folder not getting cleared

2016-01-26 Thread Adrian Bridgett
t how to get rid of this and help on understanding this behaviour. Thanks !!! Abhi -- *Adrian Bridgett* | Sysadmin Engineer, OpenSignal <http://www.opensignal.com> _ Office: 3rd Floor, The Angel Office, 2 Angel Square, London, EC1V 1NY Pho

Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-30 Thread Adrian Bridgett
I thought the Driver used). Anyhow I'll do more testing and then raise a JIRA. Adrian -- *Adrian Bridgett* | Sysadmin Engineer, OpenSignal <http://www.opensignal.com> _ Office: First Floor, Scriptor Court, 155-157 Farringdon Road, Cle

Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-30 Thread Adrian Bridgett
to be the core issue. On 29/12/2015 21:17, Ted Yu wrote: Have you searched log for 'f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4' ? In the snippet you posted, I don't see registration of this Executor. Cheers On Tue, Dec 29, 2015 at 12:43 PM, Adrian Bridgett <adr...@opensignal.com <mail

Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-30 Thread Adrian Bridgett
To wrap this up, it's the shuffle manager sending the FIN so setting spark.shuffle.io.connectionTimeout to 3600s is the only workaround right now. SPARK-12583 raised. Adrian -- *Adrian Bridgett*

Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-29 Thread Adrian Bridgett
ry setting that on the shuffle service): spark.network.timeout 180s spark.shuffle.io.connectionTimeout 240s Adrian -- *Adrian Bridgett*

Re: default parallelism and mesos executors

2015-12-15 Thread Adrian Bridgett
Thanks Iulian, I'll retest with 1.6.x once it's released (probably won't have enough spare time to test with the RC). On 11/12/2015 15:00, Iulian Dragoș wrote: On Wed, Dec 9, 2015 at 4:29 PM, Adrian Bridgett <adr...@opensignal.com <mailto:adr...@opensignal.com>> wrote:

default parallelism and mesos executors

2015-12-09 Thread Adrian Bridgett
.ec2.internal:41194/user/Executor#-1021429650]) with ID 20151117-115458-164233482-5050-24333-S22/5 15/12/02 14:34:15 INFO spark.ExecutorAllocationManager: New executor 20151117-115458-164233482-5050-24333-S22/5 has registered (new total is 1) .... >&

default parallelism and mesos executors

2015-12-02 Thread Adrian Bridgett
115458-164233482-5050-24333-S22/5 15/12/02 14:34:15 INFO spark.ExecutorAllocationManager: New executor 20151117-115458-164233482-5050-24333-S22/5 has registered (new total is 1) >>> print (sc.defaultParallelism) 42 -- *Adrian Bridg

Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread Adrian Tanase
outside of spark – loading the precomputed model and calling .predict on it… -adrian From: Andy Davidson Date: Tuesday, November 10, 2015 at 11:31 PM To: "user @spark" Subject: thought experiment: use spark ML to real time prediction Lets say I have use spark ML to train a linear model

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-10 Thread Adrian Tanase
without leaders or big inbalances. Hope this helps, -adrian On 11/9/15, 8:26 PM, "swetha" <swethakasire...@gmail.com> wrote: >Hi, > >How to recover Kafka Direct automatically when the there is a problem with >Kafka brokers? Sometimes our Kafka Brokers gets messed up a

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-10 Thread Adrian Tanase
folder to help the job recover? E.g. Go 2 steps back, hoping that kafka has those offsets. -adrian From: swetha kasireddy Date: Monday, November 9, 2015 at 10:40 PM To: Cody Koeninger Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: Kafka Direct does not recov

Re: Dynamic Allocation & Spark Streaming

2015-11-06 Thread Adrian Tanase
You can register a streaming listener – in the BatchInfo you’ll find a lot of stats (including count of received records) that you can base your logic on: https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/StreamingListener.scala

Re: How to unpersist a DStream in Spark Streaming

2015-11-06 Thread Adrian Tanase
? -adrian Sent from my iPhone On 06 Nov 2015, at 09:45, Tathagata Das <t...@databricks.com<mailto:t...@databricks.com>> wrote: Spark streaming automatically takes care of unpersisting any RDDs generated by DStream. You can set the StreamingContext.remember() to set the minimum persiste

Re: Scheduling Spark process

2015-11-05 Thread Adrian Tanase
% accuracy, when you can use HLL aggregators). Hope this helps, -adrian On 11/5/15, 10:48 AM, "danilo" <dani.ri...@gmail.com> wrote: >Hi All, > >I'm quite new about this topic and about Spark in general. > >I have a sensor that is pushing data in real time and

Re: How to use data from Database and reload every hour

2015-11-05 Thread Adrian Tanase
of broadcast them to the executors. See this thread: http://search-hadoop.com/m/q3RTt2UD6KyBO5M1=Re+Streaming+checkpoints+and+logic+change Hope this helps, -adrian From: Kay-Uwe Moosheimer Date: Thursday, November 5, 2015 at 3:33 PM To: "user@spark.apache.org<mailto:user@spark.apache.org&g

Re: Why some executors are lazy?

2015-11-04 Thread Adrian Tanase
* The scheduler might decide to take advantage of free cores in the cluster and schedule an off-node processing and you can control how long it waits through the spark.locality.wait settings Hope this helps, -adrian From: Khaled Ammar Date: Wednesday, November 4, 2015 at 4:03 PM To: Adrian Tanase

Re: Spark Streaming data checkpoint performance

2015-11-04 Thread Adrian Tanase
Nice! Thanks for sharing, I wasn’t aware of the new API. Left some comments on the JIRA and design doc. -adrian From: Shixiong Zhu Date: Tuesday, November 3, 2015 at 3:32 AM To: Thúy Hằng Lê Cc: Adrian Tanase, "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re:

Re: Why some executors are lazy?

2015-11-04 Thread Adrian Tanase
If some of the operations required involve shuffling and partitioning, it might mean that the data set is skewed to specific partitions which will create hot spotting on certain executors. -adrian From: Khaled Ammar Date: Tuesday, November 3, 2015 at 11:43 PM To: "user@spark.apach

FW: Spark streaming - failed recovery from checkpoint

2015-11-02 Thread Adrian Tanase
Re-posting here, didn’t get any feedback on the dev list. Has anyone experienced corrupted checkpoints recently? Thanks! -adrian From: Adrian Tanase Date: Thursday, October 29, 2015 at 1:38 PM To: "d...@spark.apache.org<mailto:d...@spark.apache.org>" Subject: Spark streaming -

Re: execute native system commands in Spark

2015-11-02 Thread Adrian Tanase
Have you seen .pipe()? On 11/2/15, 5:36 PM, "patcharee" wrote: >Hi, > >Is it possible to execute native system commands (in parallel) Spark, >like scala.sys.process ? > >Best, >Patcharee > >- >To

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Adrian Tanase
best bet here is reduceByKeyAndWindow with an inverse function * Make your state object more complicated and try to prune out words with very few occurrences or that haven’t been updated for a long time * You can do this by emitting None from updateStateByKey Hope this helps, -adrian

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Adrian Tanase
-with-spark-ml-pipelines.html -adrian From: Deng Ching-Mallete Date: Friday, October 30, 2015 at 4:35 AM To: Ascot Moss Cc: User Subject: Re: Pivot Data in Spark and Scala Hi, You could transform it into a pair RDD then use the combineByKey function. HTH, Deng On Thu, Oct 29, 2015 at 7:29 PM, Ascot

Re: Mock Cassandra DB Connection in Unit Testing

2015-10-29 Thread Adrian Tanase
Does it need to be a mock? Can you use sc.parallelize(data)? From: Priya Ch Date: Thursday, October 29, 2015 at 2:00 PM To: Василец Дмитрий Cc: "user@spark.apache.org", "spark-connector-u...@lists.datastax.com"

Re: Spark/Kafka Streaming Job Gets Stuck

2015-10-29 Thread Adrian Tanase
. For example you could be writing LESS files MORE OFTEN and achieve a similar effect. All of this is of course hypothetical since I don’t know what processing you are applying to the data coming from Kafka. More like food for thought. -adrian On 10/29/15, 2:50 PM, "Afshartous, Nick"

Re: Need more tasks in KafkaDirectStream

2015-10-29 Thread Adrian Tanase
) and it works great. -adrian From: varun sharma Date: Thursday, October 29, 2015 at 8:27 AM To: user Subject: Need more tasks in KafkaDirectStream Right now, there is one to one correspondence between kafka partitions and spark partitions. I dont have a requirement of one to one semantics. I need

Re: How do I parallize Spark Jobs at Executor Level.

2015-10-28 Thread Adrian Tanase
of the entries, on the worker. -adrian From: Vinoth Sankar Date: Wednesday, October 28, 2015 at 2:49 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: How do I parallize Spark Jobs at Executor Level. Hi, I'm reading and filtering large no of files using Spa

Re: Spark/Kafka Streaming Job Gets Stuck

2015-10-28 Thread Adrian Tanase
Does it work as expected with smaller batch or smaller load? Could it be that it's accumulating too many events over 3 minutes? You could also try increasing the parallelism via repartition to ensure smaller tasks that can safely fit in working memory. Sent from my iPhone > On 28 Oct 2015, at

Re: There is any way to write from spark to HBase CDH4?

2015-10-27 Thread Adrian Tanase
Also I just remembered about cloudera’s contribution http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/ From: Deng Ching-Mallete Date: Tuesday, October 27, 2015 at 12:03 PM To: avivb Cc: user Subject: Re: There is any way to write from spark to HBase

Re: There is any way to write from spark to HBase CDH4?

2015-10-27 Thread Adrian Tanase
This is probably too low level but you could consider the async client inside foreachRdd: https://github.com/OpenTSDB/asynchbase http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On 10/27/15, 11:36 AM, "avivb" wrote:

Re: There is any way to write from spark to HBase CDH4?

2015-10-27 Thread Adrian Tanase
You can get a feel for it by playing with the original library published as separate project on github https://github.com/cloudera-labs/SparkOnHBase From: Deng Ching-Mallete Date: Tuesday, October 27, 2015 at 12:39 PM To: Fengdong Yu Cc: Adrian Tanase, avivb, user Subject: Re: There is any way

Re: SPARKONHBase checkpointing issue

2015-10-27 Thread Adrian Tanase
Does this help? https://issues.apache.org/jira/browse/SPARK-5206 On 10/27/15, 1:53 PM, "Amit Singh Hora" wrote: >Hi all , > >I am using Cloudera's SparkObHbase to bulk insert in hbase ,Please find >below code >object test { > >def main(args: Array[String]): Unit =

Re: Separate all values from Iterable

2015-10-27 Thread Adrian Tanase
as documentation): def toBson(bean: ProductBean): BSONObject = { … } val customerBeans: RDD[(Long, Seq[ProductBean])] = allBeans.groupBy(_.customerId) val mongoObjects: RDD[BSONObject] = customerBeans.flatMap { case (id, beans) => beans.map(toBson) } Hope this helps, -adrian From: Shams ul Ha

Re: [Spark Streaming] How do we reset the updateStateByKey values.

2015-10-26 Thread Adrian Tanase
of updates is an resetState message. If now, continue summing the others. I can provide scala samples, my java is beyond rusty :) -adrian From: Uthayan Suthakar Date: Friday, October 23, 2015 at 2:10 PM To: Sander van Dijk Cc: user Subject: Re: [Spark Streaming] How do we reset

Re: Secondary Sorting in Spark

2015-10-26 Thread Adrian Tanase
shuffling… -adrian From: swetha kasireddy Date: Monday, October 26, 2015 at 6:56 AM To: Adrian Tanase Cc: Bill Bejeck, "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: Secondary Sorting in Spark Hi, Does the use of custom partitioner in Streaming affect performance

Re: [SPARK STREAMING] Concurrent operations in spark streaming

2015-10-26 Thread Adrian Tanase
have a particular concern or use case that relies on ordering between A, B and X2? -adrian From: Nipun Arora Date: Sunday, October 25, 2015 at 4:09 PM To: Andy Dang Cc: user Subject: Re: [SPARK STREAMING] Concurrent operations in spark streaming So essentially the driver/client program needs

Re: Accumulators internals and reliability

2015-10-26 Thread Adrian Tanase
ery makes no attempt to initialize them (See SPARK-5206<https://issues.apache.org/jira/browse/SPARK-5206?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22accumulator%20null%22>) Hope this helps, -adrian From: "Sela, Amit" Date: Monday, October 26, 2015 at 11:13 AM To: "user@spar

Re: [SPARK STREAMING] Concurrent operations in spark streaming

2015-10-26 Thread Adrian Tanase
Thinking more about it – it should only be 2 tasks as A and B are most likely collapsed by spark in a single task. Again – learn to use the spark UI as it’s really informative. The combination of DAG visualization and task count should answer most of your questions. -adrian From: Adrian

Re: Spark StreamingStatefull information

2015-10-22 Thread Adrian Tanase
The result of updatestatebykey is a dstream that emits the entire state every batch - as an RDD - nothing special about it. It easy to join / cogroup with another RDD if you have the correct keys in both. You could load this one when the job starts and/or have it update with updatestatebykey as

Re: Analyzing consecutive elements

2015-10-22 Thread Adrian Tanase
. -adrian From: Sampo Niskanen Date: Thursday, October 22, 2015 at 2:12 PM To: Adrian Tanase Cc: user Subject: Re: Analyzing consecutive elements Hi, Sorry, I'm not very familiar with those methods and cannot find the 'drop' method anywhere. As an example: val arr = Array((1, "A"

Re: Job splling to disk and memory in Spark Streaming

2015-10-21 Thread Adrian Tanase
, etc) to execute in a single stage. Hope this helps, -adrian From: Tathagata Das Date: Wednesday, October 21, 2015 at 10:36 AM To: swetha Cc: user Subject: Re: Job splling to disk and memory in Spark Streaming Well, reduceByKey needs to shutffle if your intermediate data is not already partitioned

Re: Spark on Yarn

2015-10-21 Thread Adrian Tanase
The question is the spark dependency is marked as provided or is included in the fat jar. For example, we are compiling the spark distro separately for java 8 + scala 2.11 + hadoop 2.6 (with maven) and marking it as provided in sbt. -adrian From: Raghuveer Chanda Date: Wednesday, October 21

Re: Whether Spark is appropriate for our use case.

2015-10-21 Thread Adrian Tanase
Can you share your approximate data size? all should be valid use cases for spark, wondering if you are providing enough resources. Also - do you have some expectations in terms of performance? what does "slow down" mean? For this usecase I would personally favor parquet over DB, and

Re: difference between rdd.collect().toMap to rdd.collectAsMap() ?

2015-10-20 Thread Adrian Tanase
/f85aa06464a10f5d1563302fd76465dded475a12/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L741-L753 -adrian On 10/20/15, 12:35 PM, "kali.tumm...@gmail.com" <kali.tumm...@gmail.com> wrote: >Hi All, > >Is there any performance impact when I use collectAsMap on my RD

Re: Partition for each executor

2015-10-20 Thread Adrian Tanase
I think it should use the default parallelism which by default is equal to the number of cores in your cluster. If you want to control it, specify a value for numSlices - the second param to parallelize(). -adrian On 10/20/15, 6:13 PM, "t3l" <t...@threelights.de> wr

Re: Should I convert json into parquet?

2015-10-19 Thread Adrian Tanase
about update performance then you probably need to look at a DB that offers random write access (Cassandra, Hbase..) -adrian On 10/19/15, 12:31 PM, "Ewan Leith" <ewan.le...@realitymine.com> wrote: >As Jörn says, Parquet and ORC will get you really good compression and c

Re: Spark Streaming - use the data in different jobs

2015-10-19 Thread Adrian Tanase
. Lastly, if there is no strong requirement to have different jobs, you might consider collapsing the 2 jobs into one.. And simply have multiple stages that execute in the same job. -adrian From: Ewan Leith Date: Monday, October 19, 2015 at 12:34 PM To: Oded Maimon, user Subject: RE: Spark

Re: Spark handling parallel requests

2015-10-19 Thread Adrian Tanase
and examples * This may not be a good choice if you can’t afford to lose any messages – in this case your life is harder as you’ll need to also use WAL based implementation Hope this helps, -adrian From: "tarek.abouzei...@yahoo.com.INVALID<mailto:tarek.abouzei...@yahoo

Re: How does shuffle work in spark ?

2015-10-19 Thread Adrian Tanase
I don’t know why it expands to 50 GB but it’s correct to see it both on the first operation (shuffled write) and on the next one (shuffled read). It’s the barrier between the 2 stages. -adrian From: shahid ashraf Date: Monday, October 19, 2015 at 9:53 PM To: Kartik Mathur, Adrian Tanase Cc

Re: Spark Streaming scheduler delay VS driver.cores

2015-10-19 Thread Adrian Tanase
Bump on this question – does anyone know what is the effect of spark.driver.cores on the driver's ability to manage larger clusters? Any tips on setting a correct value? I’m running Spark streaming on Yarn / Hadoop 2.6 / Spark 1.5.1. Thanks, -adrian From: Adrian Tanase Date: Saturday, October

Re: Streaming of COAP Resources

2015-10-19 Thread Adrian Tanase
” Then you need to call .store() for every new value that you’re notified of. -adrian On 10/16/15, 9:38 AM, "Sadaf" <sa...@platalytics.com> wrote: >I am currently working on IOT Coap protocol.I accessed server on local host >through copper firefox plugin. Then i Add

Re: Differentiate Spark streaming in event logs

2015-10-19 Thread Adrian Tanase
You could try to start the 2/N jobs with a slightly different log4j template, by prepending some job type to all the messages... On 10/19/15, 9:47 PM, "franklyn" wrote: >Hi I'm running a job to collect some analytics on spark jobs by analyzing >their event logs. We

Re: How to calculate row by now and output retults in Spark

2015-10-19 Thread Adrian Tanase
Are you by any chance looking for reduceByKey? IF you’re trying to collapse all the values in V into an aggregate, that’s what you should be looking at. -adrian From: Ted Yu Date: Monday, October 19, 2015 at 9:16 PM To: Shepherd Cc: user Subject: Re: How to calculate row by now and output

Spark Streaming scheduler delay VS driver.cores

2015-10-17 Thread Adrian Tanase
parallelism factor for a streaming app with 10-20 secs batch time? I found it odd that at 7 x 22 = 154 the driver is becoming a bottleneck * I’ve seen people recommend 3-4 taks/core or ~1000 parallelism for clusters in the tens of nodes Thanks in advance, -adrian

Re: repartition vs partitionby

2015-10-17 Thread Adrian Tanase
If the dataset allows it you can try to write a custom partitioner to help spark distribute the data more uniformly. Sent from my iPhone On 17 Oct 2015, at 16:14, shahid ashraf > wrote: yes i know about that,its in case to reduce partitions. the

Re: Building with SBT and Scala 2.11

2015-10-14 Thread Adrian Tanase
You are correct, of course. Gave up on sbt for spark long ago, I never managed to get it working while mvn works great. Sent from my iPhone On 14 Oct 2015, at 16:52, Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote: Adrian: Likely you were using maven. Ja

Re: Building with SBT and Scala 2.11

2015-10-13 Thread Adrian Tanase
Do you mean hadoop-2.4 or 2.6? not sure if this is the issue but I'm also compiling the 1.5.1 version with scala 2.11 and hadoop 2.6 and it works. -adrian Sent from my iPhone On 14 Oct 2015, at 03:53, Jakob Odersky <joder...@gmail.com<mailto:joder...@gmail.com>> wrote: I'm ha

Re: SQLContext within foreachRDD

2015-10-12 Thread Adrian Tanase
/scala/com/databricks/apps/logs/chapter1/LogAnalyzerStreamingSQL.scala -adrian From: Daniel Haviv Date: Monday, October 12, 2015 at 12:52 PM To: user Subject: SQLContext within foreachRDD Hi, As things that run inside foreachRDD run at the driver, does that mean that if we use SQLContext inside

Re: "dynamically" sort a large collection?

2015-10-12 Thread Adrian Tanase
I think you’re looking for the flatMap (or flatMapValues) operator – you can do something like sortedRdd.flatMapValues( v => If (v % 2 == 0) { Some(v / 2) } else { None } ) Then you need to sort again. -adrian From: Yifan LI Date: Monday, October 12, 2015 at 1:03 PM To: spark users Subj

Re: Spark retrying task indefinietly

2015-10-12 Thread Adrian Tanase
and dirty way to do it is: val lines = messages.flatMapValues(v => Try(v.toInt).toOption) This way, only the lines that are successfully parsed are kept around. Read a bit on scala.util.{Try, Success, Failure} and Options to understand what’s going on. -adrian On 10/12/15, 9:05

Re: Streaming Performance w/ UpdateStateByKey

2015-10-10 Thread Adrian Tanase
metric in the spark UI is 0-1ms in both scenarios. Any idea where else to look or why am I not seeing any performance uplift? Thanks! Adrian Sent from my iPhone On 06 Oct 2015, at 00:47, Tathagata Das <t...@databricks.com<mailto:t...@databricks.com>> wrote: You could call DSt

Re: Notification on Spark Streaming job failure

2015-10-07 Thread Adrian Tanase
processed by the job over a period of time * This is coupled with a stable stream of data from a canary instance Hope this helps – feel free to google around for all the above buzzwords :). I can get into more details on demand. -adrian From: Chen Song Date: Monday, September 28, 2015 at 5:00

Re: Secondary Sorting in Spark

2015-10-05 Thread Adrian Tanase
Great article, especially the use of a custom partitioner. Also, sorting by multiple fields by creating a tuple out of them is an awesome, easy to miss, Scala feature. Sent from my iPhone On 04 Oct 2015, at 21:41, Bill Bejeck > wrote: I've written

Re: Broadcast var is null

2015-10-05 Thread Adrian Tanase
FYI the same happens with accumulators when recovering from checkpoint. I'd love to see this fixed somehow as the workaround (using a singleton factory in foreachRdd to make sure the accumulators are initialized instead of null) is really intrusive... Sent from my iPhone On 05 Oct 2015, at

Re: Usage of transform for code reuse between Streaming and Batch job affects the performance ?

2015-10-05 Thread Adrian Tanase
the partitioning (e.g. unioning 2 DStreams or a simple flatMap) you will be forced to use transform as the DStreams hide away some of the control. -adrian Sent from my iPhone > On 05 Oct 2015, at 03:59, swetha <swethakasire...@gmail.com> wrote: > > Hi, > > I have the following code f

Re: RDD of ImmutableList

2015-10-05 Thread Adrian Tanase
If you don't need to write data back using that library I'd say go for #2. Convert to a scala class and standard lists, should be easier down the line. That being said, you may end up writing custom code if you stick with kryo anyway... Sent from my iPhone On 05 Oct 2015, at 21:42, Jakub

Re: Limiting number of cores per job in multi-threaded driver.

2015-10-04 Thread Adrian Tanase
You are absolutely correct, I apologize. My understanding was that you are sharing the machine across many jobs. That was the context in which I was making that comment. -adrian Sent from my iPhone On 03 Oct 2015, at 07:03, Philip Weaver <philip.wea...@gmail.com<mailto:phil

Re: how to broadcast huge lookup table?

2015-10-04 Thread Adrian Tanase
have a look at .transformWith, you can specify another RDD. Sent from my iPhone On 02 Oct 2015, at 21:50, "saif.a.ell...@wellsfargo.com" > wrote: I tried broadcasting a key-value rdd, but

Re: Shuffle Write v/s Shuffle Read

2015-10-02 Thread Adrian Tanase
. -adrian From: Kartik Mathur Date: Thursday, October 1, 2015 at 10:36 PM To: user Subject: Shuffle Write v/s Shuffle Read Hi I am trying to better understand shuffle in spark . Based on my understanding thus far , Shuffle Write : writes stage output for intermediate stage on local disk

Re: Accumulator of rows?

2015-10-02 Thread Adrian Tanase
Have you seen window functions? https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html From: "saif.a.ell...@wellsfargo.com" Date: Thursday, October 1, 2015 at 9:44 PM To: "user@spark.apache.org"

Re: Kafka Direct Stream

2015-10-01 Thread Adrian Tanase
would also look at filtering by topic and saving as different Dstreams in your code. Either way you need to start with Cody’s tip in order to extract the topic name. -adrian From: Cody Koeninger Date: Thursday, October 1, 2015 at 5:06 PM To: Udit Mehta Cc: user Subject: Re: Kafka Direct Stream You

Re: Lost leader exception in Kafka Direct for Streaming

2015-10-01 Thread Adrian Tanase
This also happened to me in extreme recovery scenarios – e.g. Killing 4 out of a 7 machine cluster. I’d put my money on recovering from an out of sync replica, although I haven’t done extensive testing around it. -adrian From: Cody Koeninger Date: Thursday, October 1, 2015 at 5:18 PM

Re: automatic start of streaming job on failure on YARN

2015-10-01 Thread Adrian Tanase
. -adrian From: Jeetendra Gangele Date: Thursday, October 1, 2015 at 4:30 PM To: user Subject: automatic start of streaming job on failure on YARN We've a streaming application running on yarn and we would like to ensure that is up running 24/7. Is there a way to tell yarn to automatically restart

Re: Monitoring tools for spark streaming

2015-09-29 Thread Adrian Tanase
You can also use the REST api introduced in 1.4 – although it’s harder to parse: * jobs from the same batch are not grouped together * You only get total delay, not scheduling delay From: Hari Shreedharan Date: Tuesday, September 29, 2015 at 5:27 AM To: Shixiong Zhu Cc: Siva,

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Adrian Tanase
etc with your own set-up) Lastly, has this ever worked? Maybe you’ve accidentally created the topic with more partitions and replicas than available brokers… try to recreate with fewer partitions/replicas, see if it works. -adrian From: Dmitry Goldenberg Date: Tuesday, September 29, 2015 at 3

Re: Merging two avro RDD/DataFrames

2015-09-29 Thread Adrian Tanase
under a minute with the correct resources… -adrian From: TEST ONE Date: Tuesday, September 29, 2015 at 3:00 AM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Merging two avro RDD/DataFrames I have a daily update of modified users (~100s) output as avro fro

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Adrian Tanase
I believe some of the brokers in your cluster died and there are a number of partitions that nobody is currently managing. -adrian From: Dmitry Goldenberg Date: Tuesday, September 29, 2015 at 3:26 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Kafka e

Re: Converting a DStream to schemaRDD

2015-09-29 Thread Adrian Tanase
Also check this out https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzerStreamingSQL.scala From the data bricks reference app: https://github.com/databricks/reference-apps From: Ewan Leith Date:

Re: Spark Streaming Log4j Inside Eclipse

2015-09-29 Thread Adrian Tanase
AVA Code Ashish On Mon, Sep 28, 2015 at 10:42 AM, Adrian Tanase <atan...@adobe.com<mailto:atan...@adobe.com>> wrote: You also need to provide it as parameter to spark submit http://stackoverflow.com/questions/28840438/how-to-override-sparks-log4j-properties-per-driver From: Ashish Soni D

Re: Adding / Removing worker nodes for Spark Streaming

2015-09-29 Thread Adrian Tanase
this is part of the checkpointed metadata in the spark context. -adrian From: Cody Koeninger Date: Tuesday, September 29, 2015 at 12:49 AM To: Sourabh Chandak Cc: Augustus Hong, "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: Adding / Removing worker nodes for

Re: Does YARN start new executor in place of the failed one?

2015-09-29 Thread Adrian Tanase
the issue if it’s the same use case :) Also – make sure that you have resources available in YARN, if the cluster is shared. -adrian From: Alexander Pivovarov Date: Tuesday, September 29, 2015 at 1:38 AM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Does YARN st

Re: Spark Streaming Log4j Inside Eclipse

2015-09-28 Thread Adrian Tanase
You also need to provide it as parameter to spark submit http://stackoverflow.com/questions/28840438/how-to-override-sparks-log4j-properties-per-driver From: Ashish Soni Date: Monday, September 28, 2015 at 5:18 PM To: user Subject: Spark Streaming Log4j Inside Eclipse I need to turn off the

Re: Receiver and Parallelization

2015-09-25 Thread Adrian Tanase
Good catch, I was not aware of this setting. I’m wondering though if it also generates a shuffle or if the data is still processed by the node on which it’s ingested - so that you’re not gated by the number of cores on one machine. -adrian On 9/25/15, 5:27 PM, "Silvio Fiorito" &l

Re: Reasonable performance numbers?

2015-09-25 Thread Adrian Tanase
you redistribute the data evenly across all nodes/cores and avoid most of the issues above – at least you have a guarantee that the load is spread evenly across the cluster. Hope this helps, -adrian From: "Young, Matthew T" Date: Thursday, September 24, 2015 at 11:47 PM

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Adrian Tanase
: * you’re not guaranteed to shut down gracefully * You may have a bug that prevents the state to be saved and you can’t restart the app w/o upgrade Less than ideal, yes :) -adrian From: Radu Brumariu Date: Friday, September 25, 2015 at 1:31 AM To: Cody Koeninger Cc: "user@spark.apach

Re: Using Spark for portfolio manager app

2015-09-25 Thread Adrian Tanase
code for state objects w/o updates) is cleanup users if they haven’t received updates for a long time, then load the state from DB the next time you see them. I would consider this a must-have optimization to keep some bounds on the memory needs. -adrian From: Thúy Hằng Lê Date: Friday, September 25

Re: Using Spark for portfolio manager app

2015-09-25 Thread Adrian Tanase
messages” - passing info as a 1 time message downstream: * In your updateStateByKey function, emit a tuple of (actualNewState, changedData) * Then filter this on !changedData.isEmpty or something * And only do foreachRdd on the filtered stream. Makes sense? -adrian From: Thúy Hằng Lê Date

  1   2   >