Re: High level explanation of dropDuplicates

2020-01-11 Thread Miguel Morales
I would just map to pair using the id. Then do a reduceByKey where you compare the scores and keep the highest. Then do .values and that should do it. Sent from my iPhone > On Jan 11, 2020, at 11:14 AM, Rishi Shah wrote: > >  > Thanks everyone for your contribution on this topic, I wanted to

Re: HDFS or NFS as a cache?

2017-10-02 Thread Miguel Morales
See: https://github.com/rdblue/s3committer and https://www.youtube.com/watch?v=8F2Jqw5_OnI=youtu.be On Mon, Oct 2, 2017 at 11:31 AM, Marcelo Vanzin wrote: > You don't need to collect data in the driver to save it. The code in > the original question doesn't use

Re: Spark <--> S3 flakiness

2017-05-13 Thread Miguel Morales
rk + Alluxio. > > You mentioned that it required a lot of effort to get working. May I ask > what you ran into, and how you got it to work? > > Thanks, > Gene > > On Thu, May 11, 2017 at 11:55 AM, Miguel Morales <therevolti...@gmail.com> > wrote: >> >> Mi

Re: Spark <--> S3 flakiness

2017-05-11 Thread Miguel Morales
ets in: >> https://issues.apache.org/jira/browse/SPARK-10063 >> https://issues.apache.org/jira/browse/HADOOP-13786 >> https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too. >> >> On 10 May 2017 at 22:24, Miguel Morales <therevolti...@gmail.com> wrote: >>

Re: Spark <--> S3 flakiness

2017-05-10 Thread Miguel Morales
Try using the DirectParquetOutputCommiter: http://dev.sortable.com/spark-directparquetoutputcommitter/ On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com wrote: > Hi users, we have a bunch of pyspark jobs that are using S3 for loading / > intermediate steps and final

Re: Etl with spark

2017-02-12 Thread Miguel Morales
You can parallelize the collection of s3 keys and then pass that to your map function so that files are read in parallel. Sent from my iPhone > On Feb 12, 2017, at 9:41 AM, Sam Elamin wrote: > > thanks Ayan but i was hoping to remove the dependency on a file and just

Re: TDD in Spark

2017-01-15 Thread Miguel Morales
I've also written a small blog post that may help you out: https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941#.ia6stbl6n On Sun, Jan 15, 2017 at 12:13 PM, Silvio Fiorito wrote: > You should check out Holden’s excellent

Re: Error when loading json to spark

2016-12-31 Thread Miguel Morales
Looks like it's trying to treat that path as a folder, try omitting the file name and just use the folder path. On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie wrote: > Happy new year!!! > > I am trying to load a json file into spark, the json file is attached here. > > I

Re: Dependency Injection and Microservice development with Spark

2016-12-28 Thread Miguel Morales
Hi Not sure about Spring boot but trying to use DI libraries you'll run into serialization issues.I've had luck using an old version of Scaldi. Recently though I've been passing the class types as arguments with default values. Then in the spark code it gets instantiated. So you're

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
idea, thanks! > > But unfortunately that's not possible. All containers are connected to > an overlay network. > > Is there any other possiblity to say spark that it is on the same *NODE* > as an hdfs data node? > > > On 28.12.2016 12:00, Miguel Morales wrote: >>

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
It might have to do with your container ips, it depends on your networking setup. You might want to try host networking so that the containers share the ip with the host. On Wed, Dec 28, 2016 at 1:46 AM, Karamba wrote: > > Hi Sun Rui, > > thanks for answering! > > >> Although

Re: unit testing in spark

2016-12-08 Thread Miguel Morales
ion - but would it maybe make > sense for those of us that all care about testing to try and do a hangout at > some point so that we can exchange ideas? > >> On Thu, Dec 8, 2016 at 4:15 PM, Miguel Morales <therevolti...@gmail.com> >> wrote: >> I would be interes

Re: unit testing in spark

2016-12-08 Thread Miguel Morales
I would be interested in contributing. Ive created my own library for this as well. In my blog post I talk about testing with Spark in RSpec style: https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941 Sent from my iPhone > On Dec 8, 2016, at 4:09 PM, Holden

Re: Spark app write too many small parquet files

2016-12-08 Thread Miguel Morales
Try to coalesce with a value of 2 or so. You could dynamically calculate how many partitions to have to obtain an optimal file size. Sent from my iPhone > On Dec 8, 2016, at 1:03 PM, Kevin Tran wrote: > > How many partition should it be when streaming? - As in streaming

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-05 Thread Miguel Morales
One thing I've done before is to install datadogs statsd agent on the nodes. Then you can emit metrics and stats to it and build dashboards on datadog. Sent from my iPhone > On Dec 5, 2016, at 8:17 PM, Chawla,Sumit wrote: > > Hi Manish > > I am specifically looking

Re: Spark Standalone Cluster - Running applications in JSON format

2016-11-30 Thread Miguel Morales
e log file for the history server indicates there was a > problem. > > I will keep digging around. Thanks for your help so far Miguel. > > On 1/12/2016 3:33 PM, Miguel Morales wrote: > > Try hitting: http://:18080/api/v1 > > Then hit /applications. > > That should g

Re: Spark Standalone Cluster - Running applications in JSON format

2016-11-30 Thread Miguel Morales
er - > http://:4040. > > I don't have a running driver Spark instance since I am submitting jobs to > Spark using the SparkLauncher class. Or maybe I am missing something obvious. > Apologies if so. > > > > > On 1/12/2016 3:21 PM, Miguel Morales wrote: > > Check

Re: Spark Standalone Cluster - Running applications in JSON format

2016-11-30 Thread Miguel Morales
Check the Monitoring and Instrumentation API: http://spark.apache.org/docs/latest/monitoring.html On Wed, Nov 30, 2016 at 9:20 PM, Carl Ballantyne wrote: > Hi All, > > I want to get the running applications for my Spark Standalone cluster in > JSON format. The same

Re: updateStateByKey -- when the key is multi-column (like a composite key )

2016-11-30 Thread Miguel Morales
I *think* you can return a map to updateStateByKey which would include your fields. Another approach would be to create a hash (like create a json version of the hash and return that.) On Wed, Nov 30, 2016 at 12:30 PM, shyla deshpande wrote: > updateStateByKey - Can