Reading kafka stream and writing to hdfs

2015-09-28 Thread Chengi Liu
Hi, I am going thru this example here: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py If I want to write this data on hdfs. Whats the right way to do this? Thanks

Re: parquet error

2015-09-18 Thread Chengi Liu
of jets3t jar from hadoop/lib to spark/lib (which I believe is of version 0.9ish version)? On Wed, Sep 16, 2015 at 4:59 PM, Chengi Liu <chengi.liu...@gmail.com> wrote: > Hi, > I have a spark cluster setup and I am trying to write the data to s3 but > in parquet format. > Here

parquet error

2015-09-16 Thread Chengi Liu
Hi, I have a spark cluster setup and I am trying to write the data to s3 but in parquet format. Here is what I am doing df = sqlContext.load('test', 'com.databricks.spark.avro') df.saveAsParquetFile("s3n://test") But I get some nasty error: Py4JJavaError: An error occurred while calling

Re: Installing a python library along with ec2 cluster

2015-02-08 Thread Chengi Liu
of ec2 with all the python libraries installed and create a bash script to export python_path in the /etc/init.d/ directory. Then you can launch the cluster with this image and ec2.py Hope this can be helpful Cheers Gen On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu chengi.liu...@gmail.com wrote

How to execute a custom python library on spark

2014-11-25 Thread Chengi Liu
Hi, I have written few datastructures as classes like following.. So, here is my code structure: project/foo/foo.py , __init__.py /bar/bar.py, __init__.py bar.py imports foo as from foo.foo import * /execute/execute.py imports bar as from bar.bar import * Ultimately I am

Fwd: sampling in spark

2014-10-29 Thread Chengi Liu
-- Forwarded message -- From: Chengi Liu chengi.liu...@gmail.com Date: Tue, Oct 28, 2014 at 11:23 PM Subject: Re: sampling in spark To: Davies Liu dav...@databricks.com Any suggestions.. Thanks On Tue, Oct 28, 2014 at 12:53 AM, Chengi Liu chengi.liu...@gmail.com wrote

Re: sampling in spark

2014-10-28 Thread Chengi Liu
Oops, the reference for the above code: http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945 On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I have three rdds.. X,y and p X is matrix rdd (mXn), y

sampling in spark

2014-10-28 Thread Chengi Liu
Hi, I have three rdds.. X,y and p X is matrix rdd (mXn), y is (mX1) dimension vector and p is (mX1) dimension probability vector. Now, I am trying to sample k rows from X and corresponding entries in y based on probability vector p. Here is the python implementation import randomfrom bisect

Re: sampling in spark

2014-10-28 Thread Chengi Liu
) if i in index] On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu chengi.liu...@gmail.com wrote: Oops, the reference for the above code: http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945 On Tue, Oct 28, 2014 at 12:26 AM, Chengi

Re: setting heap space

2014-10-13 Thread Chengi Liu
, reduceBy, join, sortBy etc. - If you don't have enough memory and the data is huge, then set the Storage level to DISK_AND_MEMORY_SER More you can read over here. http://spark.apache.org/docs/1.0.0/tuning.html Thanks Best Regards On Sun, Oct 12, 2014 at 10:28 PM, Chengi Liu chengi.liu

Re: setting heap space

2014-10-13 Thread Chengi Liu
/bd591dbe9e2836d9a72b87c3e63e30ffd908dfd6/Benchmark.scala#L30 Thanks Best Regards On Mon, Oct 13, 2014 at 12:36 PM, Chengi Liu chengi.liu...@gmail.com wrote: Hi Akhil, Thanks for the response.. Another query... do you know how to use spark.executor.extraJavaOptions option? SparkConf.set(spark.executor.extraJavaOptions

some more heap space error

2014-10-13 Thread Chengi Liu
Hi, I posted a query yesterday and have tried out all the options given in responses.. Basically, I am reading a very fat matrix (2000 by 50 dimension matrix) and am trying to run kmeans on it. I keep on getting heap error.. Now, I am even using persist(StorageLevel.DISK_ONLY_2) option..

setting heap space

2014-10-12 Thread Chengi Liu
Hi, I am trying to use spark but I am having hard time configuring the sparkconf... My current conf is conf = SparkConf().set(spark.executor.memory,10g).set(spark.akka.frameSize, 1).set(spark.driver.memory,16g) but I still see the java heap size error 14/10/12 09:54:50 ERROR Executor:

Re: Broadcast error

2014-09-16 Thread Chengi Liu
:35 AM, Chengi Liu chengi.liu...@gmail.com wrote: So.. same result with parallelize (matrix,1000) with broadcast.. seems like I got jvm core dump :-/ 4/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager host:47978 with 19.2 GB RAM 14/09/15 02:31:22 INFO BlockManagerInfo

Re: Broadcast error

2014-09-15 Thread Chengi Liu
)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y) print Within Set Sum of Squared Error = + str(WSSSE) Thanks Best Regards On Mon, Sep 15, 2014 at 9:16 AM, Chengi Liu chengi.liu...@gmail.com wrote: And the thing

Re: Broadcast error

2014-09-15 Thread Chengi Liu
at 2:18 PM, Chengi Liu chengi.liu...@gmail.com wrote: Hi Akhil, So with your config (specifically with set(spark.akka.frameSize , 1000)) , I see the error: org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 401970046 bytes which exceeds

Broadcast error

2014-09-14 Thread Chengi Liu
Hi, I am trying to create an rdd out of large matrix sc.parallelize suggest to use broadcast But when I do sc.broadcast(data) I get this error: Traceback (most recent call last): File stdin, line 1, in module File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 370, in

Re: Broadcast error

2014-09-14 Thread Chengi Liu
values. On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I am trying to create an rdd out of large matrix sc.parallelize suggest to use broadcast But when I do sc.broadcast(data) I get this error: Traceback (most recent call last): File stdin, line

Re: Broadcast error

2014-09-14 Thread Chengi Liu
, Sep 14, 2014 at 2:54 PM, Chengi Liu chengi.liu...@gmail.com wrote: Specifically the error I see when I try to operate on rdd created by sc.parallelize method : org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 12:12 was 12062263 bytes which exceeds

Re: Broadcast error

2014-09-14 Thread Chengi Liu
)).reduce(lambda x, y: x + y) print Within Set Sum of Squared Error = + str(WSSSE) Which is executed as following: spark-submit --master $SPARKURL clustering_example.py --executor-memory 32G --driver-memory 60G On Sun, Sep 14, 2014 at 10:47 AM, Chengi Liu chengi.liu...@gmail.com wrote: How

Re: Broadcast error

2014-09-14 Thread Chengi Liu
dav...@databricks.com wrote: Hey Chengi, What's the version of Spark you are using? It have big improvements about broadcast in 1.1, could you try it? On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu chengi.liu...@gmail.com wrote: Any suggestions.. I am really blocked on this one On Sun, Sep

Re: Broadcast error

2014-09-14 Thread Chengi Liu
And the thing is code runs just fine if I reduce the number of rows in my data? On Sun, Sep 14, 2014 at 8:45 PM, Chengi Liu chengi.liu...@gmail.com wrote: I am using spark1.0.2. This is my work cluster.. so I can't setup a new version readily... But right now, I am not using broadcast

Submitting multiple files pyspark

2014-08-27 Thread Chengi Liu
Hi, I have two files.. main_app.py and helper.py main_app.py calls some functions in helper.py. I want to use spark-submit to submit a job but how do i specify helper.py? Basically, how do i specify multiple files in spark? Thanks

Re: Question on mappartitionwithsplit

2014-08-17 Thread Chengi Liu
(index, it, foo)) also you can make f become `closure`: def f2(index, iterator): yield (index, foo) rdd.mapPartitionsWithIndex(f2) On Sun, Aug 17, 2014 at 10:25 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, In this example: http://www.cs.berkeley.edu/~pwendell/strataconf/api

Re: iterating with index in psypark

2014-08-16 Thread Chengi Liu
nevermind folks!!! On Sat, Aug 16, 2014 at 2:22 PM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I have data like following: 1,2,3,4 1,2,3,4 5,6,2,1 and so on.. I would like to create a new rdd as follows: (0,0,1) (0,1,2) (0,2,3) (0,3,4) (1,0,1) .. and so on.. How do i do

java.lang.StackOverflowError

2014-08-05 Thread Chengi Liu
Hi, I am doing some basic preprocessing in pyspark (local mode as follows): files = [ input files] def read(filename,sc): #process file return rdd if __name__ ==__main__: conf = SparkConf() conf.setMaster('local') sc = SparkContext(conf =conf) sc.setCheckpointDir(root+temp/)

Re: java.lang.StackOverflowError

2014-08-05 Thread Chengi Liu
Bump On Tuesday, August 5, 2014, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I am doing some basic preprocessing in pyspark (local mode as follows): files = [ input files] def read(filename,sc): #process file return rdd if __name__ ==__main__: conf = SparkConf

Converting matrix format

2014-07-30 Thread Chengi Liu
Hi, I have an rdd with n rows and m columns... but most of them are 0 .. its as sparse matrix.. I would like to only get the non zero entries with their index? Any equivalent python code would be for i,x in enumerate(matrix): for j,y in enumerate(x): if y: print i,j,y

Reading and processing binary format files using spark

2014-05-02 Thread Chengi Liu
Hi, Lets say I have millions of binary format files... Lets say I have this java (or python) library which reads and parses these binary formatted files.. Say import foo f = foo.open(filename) header = f.get_header() and some other methods.. What I was thinking was to write hadoop input

skip lines in spark

2014-04-23 Thread Chengi Liu
Hi, What is the easiest way to skip first n lines in rdd?? I am not able to figure this one out? Thanks

distinct in data frame in spark

2014-03-24 Thread Chengi Liu
Hi, I have a very simple use case: I have an rdd as following: d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]] Now, I want to remove all the duplicates from a column and return the remaining frame.. For example: If i want to remove the duplicate based on column 1. Then basically I would remove either row

Re: sbt assembly fails

2014-03-18 Thread Chengi Liu
are not where they are intended to be, you're just seeing it fail through all of them. I think it remains a connectivity problem from your env to the repos, possibly because of a proxy? -- Sean Owen | Director, Data Science | London On Mon, Mar 17, 2014 at 8:39 PM, Chengi Liu chengi.liu...@gmail.com

Re: sbt assembly fails

2014-03-18 Thread Chengi Liu
it out by building in command line.. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Mar 18, 2014 at 2:15 AM, Chengi Liu chengi.liu...@gmail.comwrote: Hi Sean, Yeah.. I am seeing erros across all repos and yepp

Re: Running spark examples

2014-03-17 Thread Chengi Liu
/index.html, there's a script to do it. On Mar 17, 2014, at 9:55 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I compiled the spark examples and I see that there are couple of jars spark-examples_2.10-0.9.0-incubating-sources.jar spark-examples_2.10-0.9.0-incubating.jar If I want

Re: sbt assembly fails

2014-03-17 Thread Chengi Liu
. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Mar 17, 2014 at 1:25 PM, Chengi Liu chengi.liu...@gmail.comwrote: Hi, I am trying to compile the spark project using sbt/sbt assembly.. And i see this error: [info

How to run a jar against spark

2014-03-14 Thread Chengi Liu
Hi, A very noob question.. Here is my code in eclipse import org.apache.spark.SparkContext; import org.apache.spark.SparkContext._; object HelloWorld { def main(args: Array[String]) { println(Hello, world!) val sc = new SparkContext(localhost,wordcount,args(0),Seq(args(1)))

Re: Dealing with headers in csv file pyspark

2014-02-26 Thread Chengi Liu
rather than using pyspark shell.. On Wed, Feb 26, 2014 at 9:34 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: Bad solution is to run a mapper through the data and null the counts , good solution is to trim the header before hand without Spark. On Feb 26, 2014 9:28 AM, Chengi Liu chengi.liu