create table in hive from spark-sql

2015-09-23 Thread Mohit Singh
Probably a noob question.
But I am trying to create a hive table using spark-sql.
Here is what I am trying to do:

hc = HiveContext(sc)

hdf = hc.parquetFile(output_path)

data_types = hdf.dtypes

schema = "(" + " ,".join(map(lambda x: x[0] + " " + x[1], data_types)) +")"

hc.sql(" CREATE TABLE IF NOT EXISTS example.foo " + schema)

There is already a database called "example" in hive.
But I see an error:

 An error occurred while calling o35.sql.

: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:java.lang.IllegalArgumentException:
java.net.URISyntaxException: Relative path in absolute URI:
hdfs://some_path/foo)

Also, I was wondering on how to use saveAsTable(..) construct
hdf.saveAsTable(tablename) tries to store into default db? How do I specify
the database name (example in this case) while trying to store this table?
Thanks
-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates


Re: Spark installation

2015-02-10 Thread Mohit Singh
For local machine, I dont think there is any to install.. Just unzip and go
to $SPARK_DIR/bin/spark-shell and that will open up a repl...

On Tue, Feb 10, 2015 at 3:25 PM, King sami kgsam...@gmail.com wrote:

 Hi,

 I'm new in Spark. I want to install it on my local machine (Ubunti 12.04)
 Could you help me please to install step by step Spark on may machine and
 run some Scala programms.

 Thanks




-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Re: ImportError: No module named pyspark, when running pi.py

2015-02-09 Thread Mohit Singh
I think you have to run that using $SPARK_HOME/bin/pyspark /path/to/pi.py
instead of normal python pi.py

On Mon, Feb 9, 2015 at 11:22 PM, Ashish Kumar ashish.ku...@innovaccer.com
wrote:

 *Command:*
 sudo python ./examples/src/main/python/pi.py

 *Error:*
 Traceback (most recent call last):
   File ./examples/src/main/python/pi.py, line 22, in module
 from pyspark import SparkContext
 ImportError: No module named pyspark




-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


is there a master for spark cluster in ec2

2015-01-28 Thread Mohit Singh
Hi,
  Probably a naive question.. But I am creating a spark cluster on ec2
using the ec2 scripts in there..
But is there a master param I need to set..
./bin/pyspark --master [ ] ??
I don't yet fully understand the ec2 concepts so just wanted to confirm
this??
Thanks

-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Using third party libraries in pyspark

2015-01-22 Thread Mohit Singh
Hi,
  I might be asking something very trivial, but whats the recommend way of
using third party libraries.
I am using tables to read hdf5 format file..
And here is the error trace:


print rdd.take(2)
  File /tmp/spark/python/pyspark/rdd.py, line , in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File /tmp/spark/python/pyspark/context.py, line 818, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
javaPartitions, allowLocal)
  File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 3, srv-108-23.720.rdio):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File
/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py,
line 90, in main
command = pickleSer._read_with_length(infile)
  File
/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
line 151, in _read_with_length
return self.loads(obj)
  File
/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
line 396, in loads
return cPickle.loads(obj)
  File
/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/cloudpickle.py,
line 825, in subimport
__import__(name)
ImportError: ('No module named tables', function subimport at 0x47e1398,
('tables',))

Though, import tables works fine on the local python shell.. but seems like
every thing is being pickled.. Are we expected to send all the files as
helper files? that doesn't seems right?
Thanks


-- 


Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Re: How to create Track per vehicle using spark RDD

2014-10-14 Thread Mohit Singh
Perhaps, its just me but lag function isnt familiar to me  ..
But have you tried configuring the spark appropriately
http://spark.apache.org/docs/latest/configuration.html


On Tue, Oct 14, 2014 at 5:37 PM, Manas Kar manasdebashis...@gmail.com
wrote:

 Hi,
  I have an RDD containing Vehicle Number , timestamp, Position.
  I want to get the lag function equivalent to my RDD to be able to
 create track segment of each Vehicle.

 Any help?

 PS: I have tried reduceByKey and then splitting the List of position in
 tuples. For me it runs out of memory every time because of the volume of
 data.

 ...Manas

 *For some reason I have never got any reply to my emails to the user
 group. I am hoping to break that trend this time. :)*




-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Setting up jvm in pyspark from shell

2014-09-10 Thread Mohit Singh
Hi,
  I am using pyspark shell and am trying to create an rdd from numpy matrix
rdd = sc.parallelize(matrix)
I am getting the following error:
JVMDUMP039I Processing dump event systhrow, detail
java/lang/OutOfMemoryError at 2014/09/10 22:41:44 - please wait.
JVMDUMP032I JVM requested Heap dump using
'/global/u2/m/msingh/heapdump.20140910.224144.29660.0005.phd' in response
to an event
JVMDUMP010I Heap dump written to
/global/u2/m/msingh/heapdump.20140910.224144.29660.0005.phd
JVMDUMP032I JVM requested Java dump using
'/global/u2/m/msingh/javacore.20140910.224144.29660.0006.txt' in response
to an event
JVMDUMP010I Java dump written to
/global/u2/m/msingh/javacore.20140910.224144.29660.0006.txt
JVMDUMP032I JVM requested Snap dump using
'/global/u2/m/msingh/Snap.20140910.224144.29660.0007.trc' in response to an
event
JVMDUMP010I Snap dump written to
/global/u2/m/msingh/Snap.20140910.224144.29660.0007.trc
JVMDUMP013I Processed dump event systhrow, detail
java/lang/OutOfMemoryError.
Exception AttributeError: 'SparkContext' object has no attribute '_jsc'
in bound method SparkContext.__del__ of pyspark.context.SparkContext
object at 0x11f9450 ignored
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 271,
in parallelize
jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices)
  File
/usr/common/usg/spark/1.0.2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
line 537, in __call__
  File
/usr/common/usg/spark/1.0.2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py,
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
: java.lang.OutOfMemoryError: Java heap space
at
org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:279)
at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:88)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:618)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:804)

I did try to setSystemProperty
sc.setSystemProperty(spark.executor.memory, 20g)
How do i increase jvm heap from the shell?

-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Personalized Page rank in graphx

2014-08-20 Thread Mohit Singh
Hi,
   I was wondering if Personalized Page Rank algorithm is implemented in
graphx. If the talks and presentation were to be believed (
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf)
it is.. but cant find the algo code (
https://github.com/amplab/graphx/tree/master/graphx/src/main/scala/org/apache/spark/graphx/lib
)?

Thanks

-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Re: Question on mappartitionwithsplit

2014-08-17 Thread Mohit Singh
Building on what Davies Liu said,
How about something like:
def indexing(splitIndex, iterator,*offset_lists* ):
  count = 0
  offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0
  indexed = []
  for i, e in enumerate(iterator):
index = count + offset + i
for j, ele in enumerate(e):
  indexed.append((index, j, ele))
  yield indexed

def another_funct(offset_lists):
*#get that damn offset_lists*
 #rdd.mapPartitionsWithSplit(indexing)
 *rdd.mapPartitionsWithSplit(lambda index, it: indexing(index, it,
offset_lists))*

Ps. Probably in indexing function, count variable is not really being
effective?



On Sun, Aug 17, 2014 at 11:21 AM, Chengi Liu chengi.liu...@gmail.com
wrote:

 Hi,
   Thanks for the response..
 In the second case f2??
 foo will have to be declared globablly??right??

 My function is somthing like:
 def indexing(splitIndex, iterator):
   count = 0
   offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0
   indexed = []
   for i, e in enumerate(iterator):
 index = count + offset + i
 for j, ele in enumerate(e):
   indexed.append((index, j, ele))
   yield indexed

 def another_funct(offset_lists):
 *#get that damn offset_lists*
 rdd.mapPartitionsWithSplit(indexing)
 But then, the issue is that offset_lists?
 Any suggestions?


 On Sun, Aug 17, 2014 at 11:15 AM, Davies Liu dav...@databricks.com
 wrote:

 The callback function f only accept 2 arguments, if you want to pass
 another objects to it, you need closure, such as:

 foo=xxx
 def f(index, iterator, foo):
  yield (index, foo)
 rdd.mapPartitionsWithIndex(lambda index, it: f(index, it, foo))

 also you can make f become `closure`:

 def f2(index, iterator):
 yield (index, foo)
 rdd.mapPartitionsWithIndex(f2)

 On Sun, Aug 17, 2014 at 10:25 AM, Chengi Liu chengi.liu...@gmail.com
 wrote:
  Hi,
In this example:
 
 http://www.cs.berkeley.edu/~pwendell/strataconf/api/pyspark/pyspark.rdd.RDD-class.html#mapPartitionsWithSplit
  Let say, f takes three arguments:
  def f(splitIndex, iterator, foo): yield splitIndex
  Now, how do i send this foo parameter to this method?
  rdd.mapPartitionsWithSplit(f)
  Thanks
 





-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Re: Using Python IDE for Spark Application Development

2014-08-07 Thread Mohit Singh
Take a look at this gist
https://gist.github.com/bigaidream/40fe0f8267a80e7c9cf8
That worked for me.


On Wed, Aug 6, 2014 at 7:32 PM, Sathish Kumaran Vairavelu 
vsathishkuma...@gmail.com wrote:

 Mohit, This doesn't seems to be working can you please provide more
 details? when I use from pyspark import SparkContext it is disabled in
 pycharm. I use pycharm community edition. Where should I set the
 environment variables in same python script or different python script?

 Also, should I run any Spark local cluster so Spark program runs on top of
 that?


 Appreciate your help

 -Sathish


 On Wed, Aug 6, 2014 at 6:22 PM, Mohit Singh mohit1...@gmail.com wrote:

 My naive set up..
 Adding
 os.environ['SPARK_HOME'] = /path/to/spark
 sys.path.append(/path/to/spark/python)
 on top of my script.
 from pyspark import SparkContext
 from pyspark import SparkConf
 Execution works from within pycharm...

 Though my next step is to figure out autocompletion and I bet there are
 better ways to develop apps for spark..



 On Wed, Aug 6, 2014 at 4:16 PM, Sathish Kumaran Vairavelu 
 vsathishkuma...@gmail.com wrote:

 Hello,

 I am trying to use the python IDE PyCharm for Spark application
 development. How can I use pyspark with Python IDE? Can anyone help me with
 this?


 Thanks

 Sathish





 --
 Mohit

 When you want success as badly as you want the air, then you will get
 it. There is no other secret of success.
 -Socrates





-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Re: Using Python IDE for Spark Application Development

2014-08-06 Thread Mohit Singh
My naive set up..
Adding
os.environ['SPARK_HOME'] = /path/to/spark
sys.path.append(/path/to/spark/python)
on top of my script.
from pyspark import SparkContext
from pyspark import SparkConf
Execution works from within pycharm...

Though my next step is to figure out autocompletion and I bet there are
better ways to develop apps for spark..



On Wed, Aug 6, 2014 at 4:16 PM, Sathish Kumaran Vairavelu 
vsathishkuma...@gmail.com wrote:

 Hello,

 I am trying to use the python IDE PyCharm for Spark application
 development. How can I use pyspark with Python IDE? Can anyone help me with
 this?


 Thanks

 Sathish





-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Re: Regularization parameters

2014-08-06 Thread Mohit Singh
One possible straightforward explanation might be your solution(s) might be
stuck in local minima?? And depending on your weights initialization, you
are getting different parameters?
 Maybe have same initial weights for both the runs...
or
I would probably test the execution with synthetic dataset with global
solutions..?



On Wed, Aug 6, 2014 at 7:12 PM, Burak Yavuz bya...@stanford.edu wrote:

 Hi,

 That is interesting. Would you please share some code on how you are
 setting the regularization type, regularization parameters and running
 Logistic Regression?

 Thanks,
 Burak

 - Original Message -
 From: SK skrishna...@gmail.com
 To: u...@spark.incubator.apache.org
 Sent: Wednesday, August 6, 2014 6:18:43 PM
 Subject: Regularization parameters

 Hi,

 I tried different regularization parameter values with Logistic Regression
 for binary classification of my dataset and would like to understand the
 following results:

 regType = L2, regParam = 0.0 , I am getting AUC = 0.80 and accuracy of 80%
 regType = L1, regParam = 0.0 , I am getting AUC = 0.80 and accuracy of 50%

 To calculate accuracy I am using 0.5 as threshold. prediction 0.5 is class
 0, and prediction = 0.5 is class 1.

 regParam = 0.0, implies I am not using any regularization, is that correct?
 If so, it should not matter whether I specify L1 or L2, I should get the
 same results. So why is the accuracy value different?

 thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameters-tp11601.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Reading hdf5 formats with pyspark

2014-07-28 Thread Mohit Singh
Hi,
   We have setup spark on a HPC system and are trying to implement some
data pipeline and algorithms in place.
The input data is in hdf5 (these are very high resolution brain images) and
it can be read via h5py library in python. So, my current approach (which
seems to be working ) is writing a function
def process(filename):
   #logic

and then execute via
files = [list of filenames]
sc.parallelize(files).foreach(process)

Is this the right approach??
-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Spark streaming

2014-05-01 Thread Mohit Singh
Hi,
  I guess Spark is using streaming in context of streaming live data but
what I mean is something more on the lines of  hadoop streaming.. where one
can code in any programming language?
Or is something among that lines on the cards?

Thanks
-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates