Re: UDAFs for sketching Dataset columns with T-Digests

2017-07-06 Thread Sam Bessalah
This is interesting and very useful.
Thanks.

On Thu, Jul 6, 2017 at 2:33 AM, Erik Erlandson  wrote:

> After my talk on T-Digests in Spark at Spark Summit East, there were some
> requests for a UDAF-based interface for working with Datasets.   I'm
> pleased to announce that I released a library for doing T-Digest sketching
> with UDAFs:
>
> https://github.com/isarn/isarn-sketches-spark
>
> This initial release provides support for Scala. Future releases will
> support PySpark bindings, and additional tools for leveraging T-Digests in
> ML pipelines.
>
> Cheers!
> Erik
>


Re: Spark join and large temp files

2016-08-09 Thread Sam Bessalah
Have you tried to broadcast your small table table in order to perform your
join ?

joined = bigDF.join(broadcast(smallDF, )


On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab  wrote:

> Hi Deepak,
> No...not really. Upping the disk size is a solution, but more expensive as
> you can't attach EBS volumes to EMR clusters configured with data pipelines
> easily (which is what we're doing). I've tried collecting the 1.5G dataset
> in a hashmap, and broadcasting. Timeouts seems to prevent that (even after
> upping the max driver result size). Increasing partition counts didn't help
> (the shuffle used up the temp space). I'm now looking at some form of
> clever broadcasting, or maybe falling back to chunking up the input,
> producing interim output, and unioning them for the final output. Might
> even try using Spark Streaming pointing to the parquet and seeing if that
> helps.
>
> -Ashic.
>
> --
> From: deepakmc...@gmail.com
> Date: Tue, 9 Aug 2016 17:31:19 +0530
> Subject: Re: Spark join and large temp files
> To: as...@live.com
>
> Hi Ashic
> Did you find the resolution to this issue?
> Just curious to know like what helped in this scenario.
>
> Thanks
> Deepak
>
>
> On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab  wrote:
>
> Hi Deepak,
> Thanks for the response.
>
> Registering the temp tables didn't help. Here's what I have:
>
> val a = sqlContext..read.parquet(...).select("eid.id",
> "name").withColumnRenamed("eid.id", "id")
> val b = sqlContext.read.parquet(...).select("id", "number")
>
> a.registerTempTable("a")
> b.registerTempTable("b")
>
> val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join
> b y on x.id=y.id)
>
> results.write.parquet(...)
>
> Is there something I'm missing?
>
> Cheers,
> Ashic.
>
> --
> From: deepakmc...@gmail.com
> Date: Tue, 9 Aug 2016 00:01:32 +0530
> Subject: Re: Spark join and large temp files
> To: as...@live.com
> CC: user@spark.apache.org
>
>
> Register you dataframes as temp tables and then try the join on the temp
> table.
> This should resolve your issue.
>
> Thanks
> Deepak
>
> On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab  wrote:
>
> Hello,
> We have two parquet inputs of the following form:
>
> a: id:String, Name:String  (1.5TB)
> b: id:String, Number:Int  (1.3GB)
>
> We need to join these two to get (id, Number, Name). We've tried two
> approaches:
>
> a.join(b, Seq("id"), "right_outer")
>
> where a and b are dataframes. We also tried taking the rdds, mapping them
> to pair rdds with id as the key, and then joining. What we're seeing is
> that temp file usage is increasing on the join stage, and filling up our
> disks, causing the job to crash. Is there a way to join these two data sets
> without well...crashing?
>
> Note, the ids are unique, and there's a one to one mapping between the two
> datasets.
>
> Any help would be appreciated.
>
> -Ashic.
>
>
>
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: hdfs-ha on mesos - odd bug

2015-09-14 Thread Sam Bessalah
I don't know about the broken url. But are you running HDFS as a mesos
framework? If so is it using mesos-dns?
Then you should resolve the namenode via hdfs:/// 

On Mon, Sep 14, 2015 at 3:55 PM, Adrian Bridgett 
wrote:

> I'm hitting an odd issue with running spark on mesos together with
> HA-HDFS, with an even odder workaround.
>
> In particular I get an error that it can't find the HDFS nameservice
> unless I put in a _broken_ url (discovered that workaround by mistake!).
> core-site.xml, hdfs-site.xml is distributed to the slave node - and that
> file is read since I deliberately break the file then I get an error as
> you'd expect.
>
> NB: This is a bit different to
> http://mail-archives.us.apache.org/mod_mbox/spark-user/201402.mbox/%3c1392442185079-1549.p...@n3.nabble.com%3E
>
>
> Spark 1.5.0:
>
> t=sc.textFile("hdfs://nameservice1/tmp/issue")
> t.count()
> (fails)
>
> t=sc.textFile("file://etc/passwd")
> t.count()
> (errors about bad url - should have an extra / of course)
> t=sc.textFile("hdfs://nameservice1/tmp/issue")
> t.count()
> then it works!!!
>
> I should say that using file:///etc/passwd or hdfs:///tmp/issue both fail
> as well.  Unless preceded by a broken url.I've tried setting
> spark.hadoop.cloneConf to true, no change.
>
> Sample (broken) run:
> 15/09/14 13:00:14 DEBUG HadoopRDD: Creating new JobConf and caching it for
> later re-use
> 15/09/14 13:00:14 DEBUG : address: ip-10-1-200-165/10.1.200.165
> isLoopbackAddress: false, with host 10.1.200.165 ip-10-1-200-165
> 15/09/14 13:00:14 DEBUG BlockReaderLocal:
> dfs.client.use.legacy.blockreader.local = false
> 15/09/14 13:00:14 DEBUG BlockReaderLocal: dfs.client.read.shortcircuit =
> false
> 15/09/14 13:00:14 DEBUG BlockReaderLocal:
> dfs.client.domain.socket.data.traffic = false
> 15/09/14 13:00:14 DEBUG BlockReaderLocal: dfs.domain.socket.path =
> /var/run/hdfs-sockets/dn
> 15/09/14 13:00:14 DEBUG HAUtil: No HA service delegation token found for
> logical URI hdfs://nameservice1
> 15/09/14 13:00:14 DEBUG BlockReaderLocal:
> dfs.client.use.legacy.blockreader.local = false
> 15/09/14 13:00:14 DEBUG BlockReaderLocal: dfs.client.read.shortcircuit =
> false
> 15/09/14 13:00:14 DEBUG BlockReaderLocal:
> dfs.client.domain.socket.data.traffic = false
> 15/09/14 13:00:14 DEBUG BlockReaderLocal: dfs.domain.socket.path =
> /var/run/hdfs-sockets/dn
> 15/09/14 13:00:14 DEBUG RetryUtils: multipleLinearRandomRetry = null
> 15/09/14 13:00:14 DEBUG Server: rpcKind=RPC_PROTOCOL_BUFFER,
> rpcRequestWrapperClass=class
> org.apache.hadoop.ipc.ProtobufRpcEngine$RpcRequestWrapper,
> rpcInvoker=org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker@6245f50b
> 15/09/14 13:00:14 DEBUG Client: getting client out of cache:
> org.apache.hadoop.ipc.Client@267f0fd3
> 15/09/14 13:00:14 DEBUG NativeCodeLoader: Trying to load the custom-built
> native-hadoop library...
> 15/09/14 13:00:14 DEBUG NativeCodeLoader: Loaded the native-hadoop library
> ...
> 15/09/14 13:00:14 DEBUG Client: Connecting to
> mesos-1.example.com/10.1.200.165:8020
> 15/09/14 13:00:14 DEBUG Client: IPC Client (1739425103) connection to
> mesos-1.example.com/10.1.200.165:8020 from ubuntu: starting, having
> connections 1
> 15/09/14 13:00:14 DEBUG Client: IPC Client (1739425103) connection to
> mesos-1.example.com/10.1.200.165:8020 from ubuntu sending #0
> 15/09/14 13:00:14 DEBUG Client: IPC Client (1739425103) connection to
> mesos-1.example.com/10.1.200.165:8020 from ubuntu got value #0
> 15/09/14 13:00:14 DEBUG ProtobufRpcEngine: Call: getFileInfo took 36ms
> 15/09/14 13:00:14 DEBUG FileInputFormat: Time taken to get FileStatuses: 69
> 15/09/14 13:00:14 INFO FileInputFormat: Total input paths to process : 1
> 15/09/14 13:00:14 DEBUG Client: IPC Client (1739425103) connection to
> mesos-1.example.com/10.1.200.165:8020 from ubuntu sending #1
> 15/09/14 13:00:14 DEBUG Client: IPC Client (1739425103) connection to
> mesos-1.example.com/10.1.200.165:8020 from ubuntu got value #1
> 15/09/14 13:00:14 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 1ms
> 15/09/14 13:00:14 DEBUG FileInputFormat: Total # of splits generated by
> getSplits: 2, TimeTaken: 104
> ...
> 15/09/14 13:00:24 DEBUG Client: IPC Client (1739425103) connection to
> mesos-1.example.com/10.1.200.165:8020 from ubuntu: closed
> 15/09/14 13:00:24 DEBUG Client: IPC Client (1739425103) connection to
> mesos-1.example.com/10.1.200.165:8020 from ubuntu: stopped, remaining
> connections 0
> 15/09/14 13:00:24 DEBUG
> AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received
> message
> AkkaMessage(ExecutorRemoved(20150826-133446-3217621258-5050-4064-S1),true)
> from Actor[akka://sparkDriver/temp/$g]
> 15/09/14 13:00:24 DEBUG
> AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message:
> AkkaMessage(ExecutorRemoved(20150826-133446-3217621258-5050-4064-S1),true)
> 15/09/14 13:00:24 DEBUG
> AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled
> message 

Re: Spark (Streaming?) holding on to Mesos resources

2015-01-27 Thread Sam Bessalah
Hi Geraard,
isn't this the same issueas this?
https://issues.apache.org/jira/browse/MESOS-1688

On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas gerard.m...@gmail.com wrote:

 Hi,

 We are observing with certain regularity that our Spark  jobs, as Mesos
 framework, are hoarding resources and not releasing them, resulting in
 resource starvation to all jobs running on the Mesos cluster.

 For example:
 This is a job that has spark.cores.max = 4 and spark.executor.memory=3g

 IDFrameworkHostCPUsMem…5050-16506-1146497FooStreamingdnode-4.hdfs.private713.4
 GB…5050-16506-1146495FooStreaming
 dnode-0.hdfs.private16.4 GB…5050-16506-1146491FooStreaming
 dnode-5.hdfs.private711.9 GB…5050-16506-1146449FooStreaming
 dnode-3.hdfs.private74.9 GB…5050-16506-1146247FooStreaming
 dnode-1.hdfs.private0.55.9 GB…5050-16506-1146226FooStreaming
 dnode-2.hdfs.private37.9 GB…5050-16506-1144069FooStreaming
 dnode-3.hdfs.private18.7 GB…5050-16506-1133091FooStreaming
 dnode-5.hdfs.private11.7 GB…5050-16506-1133090FooStreaming
 dnode-2.hdfs.private55.2 GB…5050-16506-1133089FooStreaming
 dnode-1.hdfs.private6.56.3 GB…5050-16506-1133088FooStreaming
 dnode-4.hdfs.private1251 MB…5050-16506-1133087FooStreaming
 dnode-0.hdfs.private6.46.8 GB
 The only way to release the resources is by manually finding the process
 in the cluster and killing it. The jobs are often streaming but also batch
 jobs show this behavior. We have more streaming jobs than batch, so stats
 are biased.
 Any ideas of what's up here? Hopefully some very bad ugly bug that has
 been fixed already and that will urge us to upgrade our infra?

 Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0

 -kr, Gerard.



Re: Spark is slow

2014-04-21 Thread Sam Bessalah
Why don't start by explaining what kind of operation you're running on
spark that's faster than hadoop mapred. Mybewe could start there. And yes
this mailing is very busy since many people are getting into Spark, it's
hard to answer to everyone.
On 21 Apr 2014 20:23, Joe L selme...@yahoo.com wrote:

 It is claimed that spark is 10x or 100x times faster than mapreduce and
 hive
 but since I started using it I haven't seen any faster performance. it is
 taking 2 minutes to run map and join tasks over just 2GB data. Instead hive
 was taking just a few seconds to join 2 tables over the same data. And, I
 haven't gotten any answers to my questions. I don't understand the purpose
 of this group and there is no enough documentations of spark and its usage.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: [ann] Spark-NYC Meetup

2014-04-21 Thread Sam Bessalah
Sounds great François.
 On 21 Apr 2014 22:31, François Le Lay f...@spotify.com wrote:

 Hi everyone,

 This is a quick email to announce the creation of a Spark-NYC Meetup.
 We have 2 upcoming events, one at PlaceIQ, another at Spotify where
 Reynold Xin (Databricks) and Christopher Johnson (Spotify) have talks
 scheduled.

 More info : http://www.meetup.com/Spark-NYC/

 --
 François Le Lay
 Data Infra Chapter Lead
 Spotify NYC
 twitter : @lelayf