Re: UDAFs for sketching Dataset columns with T-Digests
This is interesting and very useful. Thanks. On Thu, Jul 6, 2017 at 2:33 AM, Erik Erlandsonwrote: > After my talk on T-Digests in Spark at Spark Summit East, there were some > requests for a UDAF-based interface for working with Datasets. I'm > pleased to announce that I released a library for doing T-Digest sketching > with UDAFs: > > https://github.com/isarn/isarn-sketches-spark > > This initial release provides support for Scala. Future releases will > support PySpark bindings, and additional tools for leveraging T-Digests in > ML pipelines. > > Cheers! > Erik >
Re: Spark join and large temp files
Have you tried to broadcast your small table table in order to perform your join ? joined = bigDF.join(broadcast(smallDF, ) On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtabwrote: > Hi Deepak, > No...not really. Upping the disk size is a solution, but more expensive as > you can't attach EBS volumes to EMR clusters configured with data pipelines > easily (which is what we're doing). I've tried collecting the 1.5G dataset > in a hashmap, and broadcasting. Timeouts seems to prevent that (even after > upping the max driver result size). Increasing partition counts didn't help > (the shuffle used up the temp space). I'm now looking at some form of > clever broadcasting, or maybe falling back to chunking up the input, > producing interim output, and unioning them for the final output. Might > even try using Spark Streaming pointing to the parquet and seeing if that > helps. > > -Ashic. > > -- > From: deepakmc...@gmail.com > Date: Tue, 9 Aug 2016 17:31:19 +0530 > Subject: Re: Spark join and large temp files > To: as...@live.com > > Hi Ashic > Did you find the resolution to this issue? > Just curious to know like what helped in this scenario. > > Thanks > Deepak > > > On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab wrote: > > Hi Deepak, > Thanks for the response. > > Registering the temp tables didn't help. Here's what I have: > > val a = sqlContext..read.parquet(...).select("eid.id", > "name").withColumnRenamed("eid.id", "id") > val b = sqlContext.read.parquet(...).select("id", "number") > > a.registerTempTable("a") > b.registerTempTable("b") > > val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join > b y on x.id=y.id) > > results.write.parquet(...) > > Is there something I'm missing? > > Cheers, > Ashic. > > -- > From: deepakmc...@gmail.com > Date: Tue, 9 Aug 2016 00:01:32 +0530 > Subject: Re: Spark join and large temp files > To: as...@live.com > CC: user@spark.apache.org > > > Register you dataframes as temp tables and then try the join on the temp > table. > This should resolve your issue. > > Thanks > Deepak > > On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab wrote: > > Hello, > We have two parquet inputs of the following form: > > a: id:String, Name:String (1.5TB) > b: id:String, Number:Int (1.3GB) > > We need to join these two to get (id, Number, Name). We've tried two > approaches: > > a.join(b, Seq("id"), "right_outer") > > where a and b are dataframes. We also tried taking the rdds, mapping them > to pair rdds with id as the key, and then joining. What we're seeing is > that temp file usage is increasing on the join stage, and filling up our > disks, causing the job to crash. Is there a way to join these two data sets > without well...crashing? > > Note, the ids are unique, and there's a one to one mapping between the two > datasets. > > Any help would be appreciated. > > -Ashic. > > > > > > > > -- > Thanks > Deepak > www.bigdatabig.com > www.keosha.net > > > > > -- > Thanks > Deepak > www.bigdatabig.com > www.keosha.net >
Re: hdfs-ha on mesos - odd bug
I don't know about the broken url. But are you running HDFS as a mesos framework? If so is it using mesos-dns? Then you should resolve the namenode via hdfs:/// On Mon, Sep 14, 2015 at 3:55 PM, Adrian Bridgettwrote: > I'm hitting an odd issue with running spark on mesos together with > HA-HDFS, with an even odder workaround. > > In particular I get an error that it can't find the HDFS nameservice > unless I put in a _broken_ url (discovered that workaround by mistake!). > core-site.xml, hdfs-site.xml is distributed to the slave node - and that > file is read since I deliberately break the file then I get an error as > you'd expect. > > NB: This is a bit different to > http://mail-archives.us.apache.org/mod_mbox/spark-user/201402.mbox/%3c1392442185079-1549.p...@n3.nabble.com%3E > > > Spark 1.5.0: > > t=sc.textFile("hdfs://nameservice1/tmp/issue") > t.count() > (fails) > > t=sc.textFile("file://etc/passwd") > t.count() > (errors about bad url - should have an extra / of course) > t=sc.textFile("hdfs://nameservice1/tmp/issue") > t.count() > then it works!!! > > I should say that using file:///etc/passwd or hdfs:///tmp/issue both fail > as well. Unless preceded by a broken url.I've tried setting > spark.hadoop.cloneConf to true, no change. > > Sample (broken) run: > 15/09/14 13:00:14 DEBUG HadoopRDD: Creating new JobConf and caching it for > later re-use > 15/09/14 13:00:14 DEBUG : address: ip-10-1-200-165/10.1.200.165 > isLoopbackAddress: false, with host 10.1.200.165 ip-10-1-200-165 > 15/09/14 13:00:14 DEBUG BlockReaderLocal: > dfs.client.use.legacy.blockreader.local = false > 15/09/14 13:00:14 DEBUG BlockReaderLocal: dfs.client.read.shortcircuit = > false > 15/09/14 13:00:14 DEBUG BlockReaderLocal: > dfs.client.domain.socket.data.traffic = false > 15/09/14 13:00:14 DEBUG BlockReaderLocal: dfs.domain.socket.path = > /var/run/hdfs-sockets/dn > 15/09/14 13:00:14 DEBUG HAUtil: No HA service delegation token found for > logical URI hdfs://nameservice1 > 15/09/14 13:00:14 DEBUG BlockReaderLocal: > dfs.client.use.legacy.blockreader.local = false > 15/09/14 13:00:14 DEBUG BlockReaderLocal: dfs.client.read.shortcircuit = > false > 15/09/14 13:00:14 DEBUG BlockReaderLocal: > dfs.client.domain.socket.data.traffic = false > 15/09/14 13:00:14 DEBUG BlockReaderLocal: dfs.domain.socket.path = > /var/run/hdfs-sockets/dn > 15/09/14 13:00:14 DEBUG RetryUtils: multipleLinearRandomRetry = null > 15/09/14 13:00:14 DEBUG Server: rpcKind=RPC_PROTOCOL_BUFFER, > rpcRequestWrapperClass=class > org.apache.hadoop.ipc.ProtobufRpcEngine$RpcRequestWrapper, > rpcInvoker=org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker@6245f50b > 15/09/14 13:00:14 DEBUG Client: getting client out of cache: > org.apache.hadoop.ipc.Client@267f0fd3 > 15/09/14 13:00:14 DEBUG NativeCodeLoader: Trying to load the custom-built > native-hadoop library... > 15/09/14 13:00:14 DEBUG NativeCodeLoader: Loaded the native-hadoop library > ... > 15/09/14 13:00:14 DEBUG Client: Connecting to > mesos-1.example.com/10.1.200.165:8020 > 15/09/14 13:00:14 DEBUG Client: IPC Client (1739425103) connection to > mesos-1.example.com/10.1.200.165:8020 from ubuntu: starting, having > connections 1 > 15/09/14 13:00:14 DEBUG Client: IPC Client (1739425103) connection to > mesos-1.example.com/10.1.200.165:8020 from ubuntu sending #0 > 15/09/14 13:00:14 DEBUG Client: IPC Client (1739425103) connection to > mesos-1.example.com/10.1.200.165:8020 from ubuntu got value #0 > 15/09/14 13:00:14 DEBUG ProtobufRpcEngine: Call: getFileInfo took 36ms > 15/09/14 13:00:14 DEBUG FileInputFormat: Time taken to get FileStatuses: 69 > 15/09/14 13:00:14 INFO FileInputFormat: Total input paths to process : 1 > 15/09/14 13:00:14 DEBUG Client: IPC Client (1739425103) connection to > mesos-1.example.com/10.1.200.165:8020 from ubuntu sending #1 > 15/09/14 13:00:14 DEBUG Client: IPC Client (1739425103) connection to > mesos-1.example.com/10.1.200.165:8020 from ubuntu got value #1 > 15/09/14 13:00:14 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 1ms > 15/09/14 13:00:14 DEBUG FileInputFormat: Total # of splits generated by > getSplits: 2, TimeTaken: 104 > ... > 15/09/14 13:00:24 DEBUG Client: IPC Client (1739425103) connection to > mesos-1.example.com/10.1.200.165:8020 from ubuntu: closed > 15/09/14 13:00:24 DEBUG Client: IPC Client (1739425103) connection to > mesos-1.example.com/10.1.200.165:8020 from ubuntu: stopped, remaining > connections 0 > 15/09/14 13:00:24 DEBUG > AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received > message > AkkaMessage(ExecutorRemoved(20150826-133446-3217621258-5050-4064-S1),true) > from Actor[akka://sparkDriver/temp/$g] > 15/09/14 13:00:24 DEBUG > AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: > AkkaMessage(ExecutorRemoved(20150826-133446-3217621258-5050-4064-S1),true) > 15/09/14 13:00:24 DEBUG > AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled > message
Re: Spark (Streaming?) holding on to Mesos resources
Hi Geraard, isn't this the same issueas this? https://issues.apache.org/jira/browse/MESOS-1688 On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the Mesos cluster. For example: This is a job that has spark.cores.max = 4 and spark.executor.memory=3g IDFrameworkHostCPUsMem…5050-16506-1146497FooStreamingdnode-4.hdfs.private713.4 GB…5050-16506-1146495FooStreaming dnode-0.hdfs.private16.4 GB…5050-16506-1146491FooStreaming dnode-5.hdfs.private711.9 GB…5050-16506-1146449FooStreaming dnode-3.hdfs.private74.9 GB…5050-16506-1146247FooStreaming dnode-1.hdfs.private0.55.9 GB…5050-16506-1146226FooStreaming dnode-2.hdfs.private37.9 GB…5050-16506-1144069FooStreaming dnode-3.hdfs.private18.7 GB…5050-16506-1133091FooStreaming dnode-5.hdfs.private11.7 GB…5050-16506-1133090FooStreaming dnode-2.hdfs.private55.2 GB…5050-16506-1133089FooStreaming dnode-1.hdfs.private6.56.3 GB…5050-16506-1133088FooStreaming dnode-4.hdfs.private1251 MB…5050-16506-1133087FooStreaming dnode-0.hdfs.private6.46.8 GB The only way to release the resources is by manually finding the process in the cluster and killing it. The jobs are often streaming but also batch jobs show this behavior. We have more streaming jobs than batch, so stats are biased. Any ideas of what's up here? Hopefully some very bad ugly bug that has been fixed already and that will urge us to upgrade our infra? Mesos 0.20 + Marathon 0.7.4 + Spark 1.1.0 -kr, Gerard.
Re: Spark is slow
Why don't start by explaining what kind of operation you're running on spark that's faster than hadoop mapred. Mybewe could start there. And yes this mailing is very busy since many people are getting into Spark, it's hard to answer to everyone. On 21 Apr 2014 20:23, Joe L selme...@yahoo.com wrote: It is claimed that spark is 10x or 100x times faster than mapreduce and hive but since I started using it I haven't seen any faster performance. it is taking 2 minutes to run map and join tasks over just 2GB data. Instead hive was taking just a few seconds to join 2 tables over the same data. And, I haven't gotten any answers to my questions. I don't understand the purpose of this group and there is no enough documentations of spark and its usage. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: [ann] Spark-NYC Meetup
Sounds great François. On 21 Apr 2014 22:31, François Le Lay f...@spotify.com wrote: Hi everyone, This is a quick email to announce the creation of a Spark-NYC Meetup. We have 2 upcoming events, one at PlaceIQ, another at Spotify where Reynold Xin (Databricks) and Christopher Johnson (Spotify) have talks scheduled. More info : http://www.meetup.com/Spark-NYC/ -- François Le Lay Data Infra Chapter Lead Spotify NYC twitter : @lelayf