memory leak exception

2016-05-13 Thread Imran Akbar
I'm trying to save a table using this code in pyspark with 1.6.1: prices = sqlContext.sql("SELECT AVG(amount) AS mean_price, country FROM src GROUP BY country") prices.collect() prices.write.saveAsTable('prices', format='parquet', mode='overwrite', path='/mnt/bigdisk/tables') but I'm getting

Spark crashes with Filesystem recovery

2016-05-10 Thread Imran Akbar
I have some Python code that consistently ends up in this state: ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server Traceback (most recent call last): File "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start

Re: slow SQL query with cached dataset

2016-04-28 Thread Imran Akbar
Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On

Re: ordering over structs

2016-04-12 Thread Imran Akbar
.struct> > (which creates an actual struct), you are trying to use the struct datatype > (which just represents the schema of a struct). > > On Thu, Apr 7, 2016 at 3:48 PM, Imran Akbar <skunkw...@gmail.com> wrote: > >> thanks Michael, >> >> >> I'm tr

Re: ordering over structs

2016-04-07 Thread Imran Akbar
thanks Michael, I'm trying to implement the code in pyspark like so (where my dataframe has 3 columns - customer_id, dt, and product): st = StructType().add("dt", DateType(), True).add("product", StringType(), True) top = data.select("customer_id", st.alias('vs')) .groupBy("customer_id")

ordering over structs

2016-04-06 Thread Imran Akbar
I have a use case similar to this: http://stackoverflow.com/questions/33878370/spark-dataframe-select-the-first-row-of-each-group and I'm trying to understand the solution titled "ordering over structs": 1) Is a struct in Spark like a struct in C++? 2) What is an alias in this context? 3) How

partitioned parquet tables

2016-04-01 Thread Imran Akbar
Hi, I'm reading in a CSV file, and I would like to write it back as a permanent table, but with partitioning by year, etc. Currently I do this: from pyspark.sql import HiveContext sqlContext = HiveContext(sc) df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',

writing partitioned parquet files

2016-04-01 Thread Imran Akbar
Hi, I'm reading in a CSV file, and I would like to write it back as a permanent table, but with particular partitioning by year, etc. Currently I do this: from pyspark.sql import HiveContext sqlContext = HiveContext(sc) df =

Re: Error reading a CSV

2016-02-24 Thread Imran Akbar
Thanks Suresh, that worked like a charm! I created the /user/hive/warehouse directory and chmod'd to 777. regards, imran On Wed, Feb 24, 2016 at 2:48 PM, Suresh Thalamati < suresh.thalam...@gmail.com> wrote: > Try creating /user/hive/warehouse/ directory if it does not exists , and > check it

naive bayes text classifier with tf-idf in pyspark

2015-02-06 Thread Imran Akbar
Hi, I've got the following code http://pastebin.com/3kexKwg6 that's almost complete, but I have 2 questions: 1) Once I've computed the TF-IDF vector, how do I compute the vector for each string to feed into the LabeledPoint? 2) Does MLLib provide any methods to evaluate the model's precision,

installing spark 1 on hadoop 1

2014-07-02 Thread Imran Akbar
Hi, I'm trying to install spark 1 on my hadoop cluster running on EMR. I didn't have any problem installing the previous versions, but on this version I couldn't find any 'sbt' folder. However, the README still suggests using this to install Spark: ./sbt/sbt assembly which fails: ./sbt/sbt: