Re: Sample sql query using pyspark

2016-03-01 Thread James Barney
Maurin, I don't know the technical reason why but: try removing the 'limit 100' part of your query. I was trying to do something similar the other week and what I found is that each executor doesn't necessarily get the same 100 rows. Joins would fail or result with a bunch of nulls when keys

Re: How could I do this algorithm in Spark?

2016-02-24 Thread James Barney
Guillermo, I think you're after an associative algorithm where A is ultimately associated with D, correct? Jakob would correct if that is a typo--a sort would be all that is necessary in that case. I believe you're looking for something else though, if I understand correctly. This seems like a

pyspark.DataFrame.dropDuplicates

2016-02-12 Thread James Barney
Hi all, Just wondering what the actual logic governing DataFrame.dropDuplicates() is? For example: >>> from pyspark.sql import Row >>> df = sc.parallelize([ \ Row(name='Alice', age=5, height=80, itemsInPocket=[pen, pencil, paper]), \ Row(name='Alice', age=5, height=80), itemsInPocket=[pen,

Re: Extract all the values from describe

2016-02-08 Thread James Barney
Hi Arunkumar, >From the scala documentation it's recommended to use the agg function for performing any actual statistics programmatically on your data. df.describe() is meant only for data exploration. See Aggregator here: