Re: Guys is this some form of Spam or someone has left his auto-reply loose LOL

2016-07-28 Thread Pedro Rodriguez
risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destr

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Pedro Rodriguez
ying the files from s3 to a normal file system on one > of my workers and then concatenating the files into a few much large files, > then using ‘hadoop fs put’ to move them to hdfs. Do you think this would > improve the spark count() performance issue? > > Does anyone know of heuristic

Re: dynamic coalesce to pick file size

2016-07-26 Thread Pedro Rodriguez
here have a way of > dynamically picking the number depending of the file size wanted? (around > 256mb would be perfect) > > > > I am running spark 1.6 on CDH using yarn, the files are written in parquet > format. > > > > Thanks > > > -- Pedro Rodriguez

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread Pedro Rodriguez
ml#org.apache.spark.sql.Row (1.6 docs, but should be similar in 2.0) On Tue, Jul 26, 2016 at 7:08 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > And Pedro has made sense of a world running amok, scared, and drunken > stupor. > > Regards, > Gourav > > On Tue, Jul 2

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread Pedro Rodriguez
call collect spark* do nothing* so you df would not >> have any data -> can’t call foreach. >> Call collect execute the process -> get data -> foreach is ok. >> >> >> On Jul 26, 2016, at 2:30 PM, kevin <kiss.kevin...@gmail.com> wrote: >> >>

Re: Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread Pedro Rodriguez
o stay that way, but am not sure. If thats the case, is there a > good/robust way to get this behavior? > > -- > Pedro Rodriguez > PhD Student in Distributed Machine Learning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 9

Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread Pedro Rodriguez
of duplicated data 3. Preserve data for all other dates I am guessing that overwrite would not work here or if it does its not guaranteed to stay that way, but am not sure. If thats the case, is there a good/robust way to get this behavior? -- Pedro Rodriguez PhD Student in Distributed Machine

Re: Spark 2.0

2016-07-25 Thread Pedro Rodriguez
willing to go fix it myself). Should I just > create a ticket? > > Thank you, > > Bryan Jeffrey > > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience

Re: How to generate a sequential key in rdd across executors

2016-07-24 Thread Pedro Rodriguez
If you can use a dataframe then you could use rank + window function at the expense of an extra sort. Do you have an example of zip with index not working, that seems surprising. On Jul 23, 2016 10:24 PM, "Andrew Ehrlich" wrote: > It’s hard to do in a distributed system.

Re: Choosing RDD/DataFrame/DataSet and Cluster Tuning

2016-07-23 Thread Pedro Rodriguez
each or 10 of 7 cores each. You can also kick up the memory to use more of your cluster’s memory. Lastly, if you are running on EC2 make sure to configure spark.local.dir to write to something that is not an EBS volume, preferably an attached SSD to something like an r3 machine. — Pedro Rodriguez

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread Pedro Rodriguez
your setup that might affect networking. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 23, 2016 at 9:10:31 AM, VG (vlin...@gmail.com) wrote

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread Pedro Rodriguez
a security group which allows all traffic to/from itself to itself. If you are using something like ufw on ubuntu then you probably need to know the ip addresses of the worker nodes beforehand. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist

Re: Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread Pedro Rodriguez
; > >> In later part of the code I need to change a datastructure and update > name with index value generated above . > >> I am unable to figure out how to do a look up here.. > >> > >> Please suggest /. > >> > >> If there i

Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Pedro Rodriguez
o use sparkR dapply on SparkDataFrame, don't we need partition the > DataFrame first? the example in doc doesn't seem to do this. > Without knowing how it partitioned, how can one write the function to > process each partition? > > Neil > > On Fri, Jul 22, 2016 at 5:56 PM, Pedro R

Re: spark and plot data

2016-07-22 Thread Pedro Rodriguez
been that we cannot > download the notebooks, cannot export them and certainly cannot sync them > back to Github, without mind numbing and sometimes irritating hacks. Have > those issues been resolved? > > > Regards, > Gourav > > > On Fri, Jul 22, 2016 at 2:22 PM,

Re: How to search on a Dataset / RDD <Row, Long >

2016-07-22 Thread Pedro Rodriguez
id the following >> JavaPairRDD<Row, Long> indexedBId = business.javaRDD() >> >> .zipWithIndex(); >> >> In later part of the code I need to change a datastructure and update >> name with index value generated above . >> I am unable to figure out how to do

Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Pedro Rodriguez
This should work and I don't think triggers any actions: df.rdd.partitions.length On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang <iam...@gmail.com> wrote: > Seems no function does this in Spark 2.0 preview? > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | C

Re: spark and plot data

2016-07-22 Thread Pedro Rodriguez
are insufficient. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 22, 2016 at 3:04:48 AM, Marco Colombo (ing.marco.colo...@gmail.com) wrote: Take

Re: How can we control CPU and Memory per Spark job operation..

2016-07-22 Thread Pedro Rodriguez
new job where the cpu/memory ratio is more favorable which reads from the prior job’s output. I am guessing this heavily depends on how expensive reloading the data set from disk/network is.  Hopefully one of these helps, — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boul

Re: How can we control CPU and Memory per Spark job operation..

2016-07-16 Thread Pedro Rodriguez
You could call map on an RDD which has “many” partitions, then call repartition/coalesce to drastically reduce the number of partitions so that your second map job has less things running. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist

Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Pedro Rodriguez
Out of curiosity, is there a way to pull all the data back to the driver to save without collect()? That is, stream the data in chunks back to the driver so that maximum memory used comparable to a single node’s data, but all the data is saved on one node. — Pedro Rodriguez PhD Student

Re: Call http request from within Spark

2016-07-14 Thread Pedro Rodriguez
_key, api_secret)) > > # print json.dumps(data.json(), indent=4) > > return data > > > when I print the json dump of the data i see it returning results from the > rest call. But the count never stops. > > > Is there an efficient way of dealing this?

Re: Tools for Balancing Partitions by Size

2016-07-13 Thread Pedro Rodriguez
g HIVE just let me know. > > > Regards, > Gourav Sengupta > > On Wed, Jul 13, 2016 at 5:39 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> > wrote: > >> The primary goal for balancing partitions would be for the write to S3. >> We would like to prevent unbala

Re: Tools for Balancing Partitions by Size

2016-07-12 Thread Pedro Rodriguez
that to estimate the total size. Thanks for the idea. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 12, 2016 at 7:26:17 PM, Hatim Diab (timd

Tools for Balancing Partitions by Size

2016-07-12 Thread Pedro Rodriguez
also be useful to get programmatic access to the size of the RDD in memory if it is cached. Thanks, -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha

Spark SQL: Merge Arrays/Sets

2016-07-11 Thread Pedro Rodriguez
ets('words)) -> list of distinct words Thanks, -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience

Re: question about UDAF

2016-07-11 Thread Pedro Rodriguez
: Row): Any = { > buffer(0) > } > } > > > I don't quit understand why I get empty result from my UDAF, I guess there > may be 2 reason: > 1. error initialization with "" in code of define initialize method > 2. the buffer didn't write successfully. &g

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
Thanks Michael, That seems like the analog to sorting tuples. I am curious, is there a significant performance penalty to the UDAF versus that? Its certainly nicer and more compact code at least. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data

Re: problem making Zeppelin 0.6 work with Spark 1.6.1, throwing jackson.databind.JsonMappingException exception

2016-07-09 Thread Pedro Rodriguez
It would be helpful if you included relevant configuration files from each or if you are using the defaults, particularly any changes to class paths. I worked through Zeppelin to 0.6.0 at work and at home without any issue so hard to say more without having more details. — Pedro Rodriguez PhD

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
input at runtime? Thanks, — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 9, 2016 at 1:33:18 AM, Pedro Rodriguez (ski.rodrig...@gmail.com

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
spark sql types are allowed? — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 8, 2016 at 6:06:32 PM, Xinh Huynh (xinh.hu...@gmail.com) wrote

DataFrame Min By Column

2016-07-08 Thread Pedro Rodriguez
Is there a way to on a GroupedData (from groupBy in DataFrame) to have an aggregate that returns column A based on a min of column B? For example, I have a list of sites visited by a given user and I would like to find the event with the minimum time (first event) Thanks, -- Pedro Rodriguez PhD

Re: Custom RDD: Report Size of Partition in Bytes to Spark

2016-07-04 Thread Pedro Rodriguez
/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/S3RDD.scala#L100-L105 Reflection code:  https://github.com/EntilZha/spark-s3/blob/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/PrivateMethodUtil.scala Thanks, — Pedro Rodriguez PhD Student in Large-Scale Machine

Custom RDD: Report Size of Partition in Bytes to Spark

2016-07-03 Thread Pedro Rodriguez
could find were some Hadoop metrics. Is there a way to simply report the number of bytes a partition read so Spark can put it on the UI? Thanks, — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io

Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
That was indeed the case, using UTF8Deserializer makes everything work correctly. Thanks for the tips! On Thu, Jun 30, 2016 at 3:32 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > Quick update, I was able to get most of the plumbing to work thanks to the > code Holden posted

Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
://github.com/apache/spark/blob/v1.6.2/python/pyspark/rdd.py#L182: _pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified. On Thu, Jun 30, 2016 at 2:13 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > Thanks Jeff a

Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
java. >> >> https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py#L172 >> >> >> >> On Thu, Jun 30, 2016 at 9:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com >> > wrote: >> >>> Hi All, >>> >>> I have written a Scala

Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
t thing I would run into is converting the JVM RDD[String] back to a Python RDD, what is the easiest way to do this? Overall, is this a good approach to calling the same API in Scala and Python? -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Al

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Pedro Rodriguez
e").count.select('name.as[String], 'count.as [Long]).collect() Does that seem like a correct understanding of Datasets? On Sat, Jun 18, 2016 at 6:39 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > Looks like it was my own fault. I had spark 2.0 cloned/built, but had the &g

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Pedro Rodriguez
t($"_1", $"count").show > > +---+-+ > > | _1|count| > > +---+-+ > > | 1|1| > > | 2|1| > > +---+-+ > > > > On Sat, Jun 18, 2016 at 3:09 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> > wrote: > >>

Re: Skew data

2016-06-17 Thread Pedro Rodriguez
ata. > > I read that if the data was skewed while joining it would take long time > to finish the job.(99 percent finished in seconds where 1 percent of task > taking minutes to hour). > > How to handle skewed data in spark. > > Thanks, > Selvam R > +91-97877-87724 >

Re: Dataset Select Function after Aggregate Error

2016-06-17 Thread Pedro Rodriguez
gt; http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset >> >> It might be different in 2.0. >> >> Xinh >> >> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez <ski.rodrig...@gmail.com >> > wrote: >&

Dataset Select Function after Aggregate Error

2016-06-17 Thread Pedro Rodriguez
low is the equivalent Dataframe code which works as expected: df.groupBy("uid").count().select("uid") Thanks! -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.

Undestanding Spark Rebalancing

2016-01-14 Thread Pedro Rodriguez
work just fine for me, but I can't seem to find out for sure if Spark does job re-scheduling/stealing. Thanks -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com

Re: How to speed up MLlib LDA?

2015-09-22 Thread Pedro Rodriguez
parallelize(Seq(input))) // fast > model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this > seems to take about 4 seconds to execute > > > marko > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig.

Re: Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Pedro Rodriguez
using Spark 1.4.1, and I want to know how can I find the complete function list supported in Spark SQL, currently I only know 'sum','count','min','max'. Thanks a lot. -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig

Re: Spark Interview Questions

2015-07-29 Thread Pedro Rodriguez
-berkeleyx-cs100-1x https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 208-340-1703 Github: github.com/EntilZha | LinkedIn: https

Re: Spark SQL Table Caching

2015-07-22 Thread Pedro Rodriguez
.saveAsTable? I am only seeing a 10 %- 20% performance increase. -- Pedro Rodriguez UCBerkeley 2014 | Computer Science SnowGeek http://SnowGeek.org pedro-rodriguez.com ski.rodrig...@gmail.com 208-340-1703

Re: Check for null in PySpark DataFrame

2015-07-02 Thread Pedro Rodriguez
User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Pedro Rodriguez UCBerkeley 2014 | Computer Science