create table in hive from spark-sql

2015-09-23 Thread Mohit Singh
Probably a noob question. But I am trying to create a hive table using spark-sql. Here is what I am trying to do: hc = HiveContext(sc) hdf = hc.parquetFile(output_path) data_types = hdf.dtypes schema = "(" + " ,".join(map(lambda x: x[0] + " " + x[1], data_types)) +")" hc.sql(" CREATE TABLE IF

Re: Spark installation

2015-02-10 Thread Mohit Singh
For local machine, I dont think there is any to install.. Just unzip and go to $SPARK_DIR/bin/spark-shell and that will open up a repl... On Tue, Feb 10, 2015 at 3:25 PM, King sami kgsam...@gmail.com wrote: Hi, I'm new in Spark. I want to install it on my local machine (Ubunti 12.04) Could

Re: ImportError: No module named pyspark, when running pi.py

2015-02-09 Thread Mohit Singh
I think you have to run that using $SPARK_HOME/bin/pyspark /path/to/pi.py instead of normal python pi.py On Mon, Feb 9, 2015 at 11:22 PM, Ashish Kumar ashish.ku...@innovaccer.com wrote: *Command:* sudo python ./examples/src/main/python/pi.py *Error:* Traceback (most recent call last):

is there a master for spark cluster in ec2

2015-01-28 Thread Mohit Singh
Hi, Probably a naive question.. But I am creating a spark cluster on ec2 using the ec2 scripts in there.. But is there a master param I need to set.. ./bin/pyspark --master [ ] ?? I don't yet fully understand the ec2 concepts so just wanted to confirm this?? Thanks -- Mohit When you want

Using third party libraries in pyspark

2015-01-22 Thread Mohit Singh
Hi, I might be asking something very trivial, but whats the recommend way of using third party libraries. I am using tables to read hdf5 format file.. And here is the error trace: print rdd.take(2) File /tmp/spark/python/pyspark/rdd.py, line , in take res =

Re: How to create Track per vehicle using spark RDD

2014-10-14 Thread Mohit Singh
Perhaps, its just me but lag function isnt familiar to me .. But have you tried configuring the spark appropriately http://spark.apache.org/docs/latest/configuration.html On Tue, Oct 14, 2014 at 5:37 PM, Manas Kar manasdebashis...@gmail.com wrote: Hi, I have an RDD containing Vehicle Number

Setting up jvm in pyspark from shell

2014-09-10 Thread Mohit Singh
Hi, I am using pyspark shell and am trying to create an rdd from numpy matrix rdd = sc.parallelize(matrix) I am getting the following error: JVMDUMP039I Processing dump event systhrow, detail java/lang/OutOfMemoryError at 2014/09/10 22:41:44 - please wait. JVMDUMP032I JVM requested Heap dump

Personalized Page rank in graphx

2014-08-20 Thread Mohit Singh
Hi, I was wondering if Personalized Page Rank algorithm is implemented in graphx. If the talks and presentation were to be believed ( https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf) it is.. but cant find the algo code (

Re: Question on mappartitionwithsplit

2014-08-17 Thread Mohit Singh
Building on what Davies Liu said, How about something like: def indexing(splitIndex, iterator,*offset_lists* ): count = 0 offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0 indexed = [] for i, e in enumerate(iterator): index = count + offset + i for j, ele in

Re: Using Python IDE for Spark Application Development

2014-08-07 Thread Mohit Singh
On Wed, Aug 6, 2014 at 6:22 PM, Mohit Singh mohit1...@gmail.com wrote: My naive set up.. Adding os.environ['SPARK_HOME'] = /path/to/spark sys.path.append(/path/to/spark/python) on top of my script. from pyspark import SparkContext from pyspark import SparkConf Execution works from within

Re: Using Python IDE for Spark Application Development

2014-08-06 Thread Mohit Singh
My naive set up.. Adding os.environ['SPARK_HOME'] = /path/to/spark sys.path.append(/path/to/spark/python) on top of my script. from pyspark import SparkContext from pyspark import SparkConf Execution works from within pycharm... Though my next step is to figure out autocompletion and I bet there

Re: Regularization parameters

2014-08-06 Thread Mohit Singh
One possible straightforward explanation might be your solution(s) might be stuck in local minima?? And depending on your weights initialization, you are getting different parameters? Maybe have same initial weights for both the runs... or I would probably test the execution with synthetic

Reading hdf5 formats with pyspark

2014-07-28 Thread Mohit Singh
Hi, We have setup spark on a HPC system and are trying to implement some data pipeline and algorithms in place. The input data is in hdf5 (these are very high resolution brain images) and it can be read via h5py library in python. So, my current approach (which seems to be working ) is writing

Spark streaming

2014-05-01 Thread Mohit Singh
Hi, I guess Spark is using streaming in context of streaming live data but what I mean is something more on the lines of hadoop streaming.. where one can code in any programming language? Or is something among that lines on the cards? Thanks -- Mohit When you want success as badly as you