create table in hive from spark-sql
Probably a noob question. But I am trying to create a hive table using spark-sql. Here is what I am trying to do: hc = HiveContext(sc) hdf = hc.parquetFile(output_path) data_types = hdf.dtypes schema = "(" + " ,".join(map(lambda x: x[0] + " " + x[1], data_types)) +")" hc.sql(" CREATE TABLE IF NOT EXISTS example.foo " + schema) There is already a database called "example" in hive. But I see an error: An error occurred while calling o35.sql. : org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs://some_path/foo) Also, I was wondering on how to use saveAsTable(..) construct hdf.saveAsTable(tablename) tries to store into default db? How do I specify the database name (example in this case) while trying to store this table? Thanks -- Mohit "When you want success as badly as you want the air, then you will get it. There is no other secret of success." -Socrates
Re: Spark installation
For local machine, I dont think there is any to install.. Just unzip and go to $SPARK_DIR/bin/spark-shell and that will open up a repl... On Tue, Feb 10, 2015 at 3:25 PM, King sami kgsam...@gmail.com wrote: Hi, I'm new in Spark. I want to install it on my local machine (Ubunti 12.04) Could you help me please to install step by step Spark on may machine and run some Scala programms. Thanks -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Re: ImportError: No module named pyspark, when running pi.py
I think you have to run that using $SPARK_HOME/bin/pyspark /path/to/pi.py instead of normal python pi.py On Mon, Feb 9, 2015 at 11:22 PM, Ashish Kumar ashish.ku...@innovaccer.com wrote: *Command:* sudo python ./examples/src/main/python/pi.py *Error:* Traceback (most recent call last): File ./examples/src/main/python/pi.py, line 22, in module from pyspark import SparkContext ImportError: No module named pyspark -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
is there a master for spark cluster in ec2
Hi, Probably a naive question.. But I am creating a spark cluster on ec2 using the ec2 scripts in there.. But is there a master param I need to set.. ./bin/pyspark --master [ ] ?? I don't yet fully understand the ec2 concepts so just wanted to confirm this?? Thanks -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Using third party libraries in pyspark
Hi, I might be asking something very trivial, but whats the recommend way of using third party libraries. I am using tables to read hdf5 format file.. And here is the error trace: print rdd.take(2) File /tmp/spark/python/pyspark/rdd.py, line , in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /tmp/spark/python/pyspark/context.py, line 818, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, srv-108-23.720.rdio): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py, line 90, in main command = pickleSer._read_with_length(infile) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 151, in _read_with_length return self.loads(obj) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py, line 396, in loads return cPickle.loads(obj) File /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/cloudpickle.py, line 825, in subimport __import__(name) ImportError: ('No module named tables', function subimport at 0x47e1398, ('tables',)) Though, import tables works fine on the local python shell.. but seems like every thing is being pickled.. Are we expected to send all the files as helper files? that doesn't seems right? Thanks -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Re: How to create Track per vehicle using spark RDD
Perhaps, its just me but lag function isnt familiar to me .. But have you tried configuring the spark appropriately http://spark.apache.org/docs/latest/configuration.html On Tue, Oct 14, 2014 at 5:37 PM, Manas Kar manasdebashis...@gmail.com wrote: Hi, I have an RDD containing Vehicle Number , timestamp, Position. I want to get the lag function equivalent to my RDD to be able to create track segment of each Vehicle. Any help? PS: I have tried reduceByKey and then splitting the List of position in tuples. For me it runs out of memory every time because of the volume of data. ...Manas *For some reason I have never got any reply to my emails to the user group. I am hoping to break that trend this time. :)* -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Setting up jvm in pyspark from shell
Hi, I am using pyspark shell and am trying to create an rdd from numpy matrix rdd = sc.parallelize(matrix) I am getting the following error: JVMDUMP039I Processing dump event systhrow, detail java/lang/OutOfMemoryError at 2014/09/10 22:41:44 - please wait. JVMDUMP032I JVM requested Heap dump using '/global/u2/m/msingh/heapdump.20140910.224144.29660.0005.phd' in response to an event JVMDUMP010I Heap dump written to /global/u2/m/msingh/heapdump.20140910.224144.29660.0005.phd JVMDUMP032I JVM requested Java dump using '/global/u2/m/msingh/javacore.20140910.224144.29660.0006.txt' in response to an event JVMDUMP010I Java dump written to /global/u2/m/msingh/javacore.20140910.224144.29660.0006.txt JVMDUMP032I JVM requested Snap dump using '/global/u2/m/msingh/Snap.20140910.224144.29660.0007.trc' in response to an event JVMDUMP010I Snap dump written to /global/u2/m/msingh/Snap.20140910.224144.29660.0007.trc JVMDUMP013I Processed dump event systhrow, detail java/lang/OutOfMemoryError. Exception AttributeError: 'SparkContext' object has no attribute '_jsc' in bound method SparkContext.__del__ of pyspark.context.SparkContext object at 0x11f9450 ignored Traceback (most recent call last): File stdin, line 1, in module File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 271, in parallelize jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices) File /usr/common/usg/spark/1.0.2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /usr/common/usg/spark/1.0.2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:279) at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:88) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) at java.lang.reflect.Method.invoke(Method.java:618) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:804) I did try to setSystemProperty sc.setSystemProperty(spark.executor.memory, 20g) How do i increase jvm heap from the shell? -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Personalized Page rank in graphx
Hi, I was wondering if Personalized Page Rank algorithm is implemented in graphx. If the talks and presentation were to be believed ( https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf) it is.. but cant find the algo code ( https://github.com/amplab/graphx/tree/master/graphx/src/main/scala/org/apache/spark/graphx/lib )? Thanks -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Re: Question on mappartitionwithsplit
Building on what Davies Liu said, How about something like: def indexing(splitIndex, iterator,*offset_lists* ): count = 0 offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0 indexed = [] for i, e in enumerate(iterator): index = count + offset + i for j, ele in enumerate(e): indexed.append((index, j, ele)) yield indexed def another_funct(offset_lists): *#get that damn offset_lists* #rdd.mapPartitionsWithSplit(indexing) *rdd.mapPartitionsWithSplit(lambda index, it: indexing(index, it, offset_lists))* Ps. Probably in indexing function, count variable is not really being effective? On Sun, Aug 17, 2014 at 11:21 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, Thanks for the response.. In the second case f2?? foo will have to be declared globablly??right?? My function is somthing like: def indexing(splitIndex, iterator): count = 0 offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0 indexed = [] for i, e in enumerate(iterator): index = count + offset + i for j, ele in enumerate(e): indexed.append((index, j, ele)) yield indexed def another_funct(offset_lists): *#get that damn offset_lists* rdd.mapPartitionsWithSplit(indexing) But then, the issue is that offset_lists? Any suggestions? On Sun, Aug 17, 2014 at 11:15 AM, Davies Liu dav...@databricks.com wrote: The callback function f only accept 2 arguments, if you want to pass another objects to it, you need closure, such as: foo=xxx def f(index, iterator, foo): yield (index, foo) rdd.mapPartitionsWithIndex(lambda index, it: f(index, it, foo)) also you can make f become `closure`: def f2(index, iterator): yield (index, foo) rdd.mapPartitionsWithIndex(f2) On Sun, Aug 17, 2014 at 10:25 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, In this example: http://www.cs.berkeley.edu/~pwendell/strataconf/api/pyspark/pyspark.rdd.RDD-class.html#mapPartitionsWithSplit Let say, f takes three arguments: def f(splitIndex, iterator, foo): yield splitIndex Now, how do i send this foo parameter to this method? rdd.mapPartitionsWithSplit(f) Thanks -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Re: Using Python IDE for Spark Application Development
Take a look at this gist https://gist.github.com/bigaidream/40fe0f8267a80e7c9cf8 That worked for me. On Wed, Aug 6, 2014 at 7:32 PM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: Mohit, This doesn't seems to be working can you please provide more details? when I use from pyspark import SparkContext it is disabled in pycharm. I use pycharm community edition. Where should I set the environment variables in same python script or different python script? Also, should I run any Spark local cluster so Spark program runs on top of that? Appreciate your help -Sathish On Wed, Aug 6, 2014 at 6:22 PM, Mohit Singh mohit1...@gmail.com wrote: My naive set up.. Adding os.environ['SPARK_HOME'] = /path/to/spark sys.path.append(/path/to/spark/python) on top of my script. from pyspark import SparkContext from pyspark import SparkConf Execution works from within pycharm... Though my next step is to figure out autocompletion and I bet there are better ways to develop apps for spark.. On Wed, Aug 6, 2014 at 4:16 PM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: Hello, I am trying to use the python IDE PyCharm for Spark application development. How can I use pyspark with Python IDE? Can anyone help me with this? Thanks Sathish -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Re: Using Python IDE for Spark Application Development
My naive set up.. Adding os.environ['SPARK_HOME'] = /path/to/spark sys.path.append(/path/to/spark/python) on top of my script. from pyspark import SparkContext from pyspark import SparkConf Execution works from within pycharm... Though my next step is to figure out autocompletion and I bet there are better ways to develop apps for spark.. On Wed, Aug 6, 2014 at 4:16 PM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: Hello, I am trying to use the python IDE PyCharm for Spark application development. How can I use pyspark with Python IDE? Can anyone help me with this? Thanks Sathish -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Re: Regularization parameters
One possible straightforward explanation might be your solution(s) might be stuck in local minima?? And depending on your weights initialization, you are getting different parameters? Maybe have same initial weights for both the runs... or I would probably test the execution with synthetic dataset with global solutions..? On Wed, Aug 6, 2014 at 7:12 PM, Burak Yavuz bya...@stanford.edu wrote: Hi, That is interesting. Would you please share some code on how you are setting the regularization type, regularization parameters and running Logistic Regression? Thanks, Burak - Original Message - From: SK skrishna...@gmail.com To: u...@spark.incubator.apache.org Sent: Wednesday, August 6, 2014 6:18:43 PM Subject: Regularization parameters Hi, I tried different regularization parameter values with Logistic Regression for binary classification of my dataset and would like to understand the following results: regType = L2, regParam = 0.0 , I am getting AUC = 0.80 and accuracy of 80% regType = L1, regParam = 0.0 , I am getting AUC = 0.80 and accuracy of 50% To calculate accuracy I am using 0.5 as threshold. prediction 0.5 is class 0, and prediction = 0.5 is class 1. regParam = 0.0, implies I am not using any regularization, is that correct? If so, it should not matter whether I specify L1 or L2, I should get the same results. So why is the accuracy value different? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameters-tp11601.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Reading hdf5 formats with pyspark
Hi, We have setup spark on a HPC system and are trying to implement some data pipeline and algorithms in place. The input data is in hdf5 (these are very high resolution brain images) and it can be read via h5py library in python. So, my current approach (which seems to be working ) is writing a function def process(filename): #logic and then execute via files = [list of filenames] sc.parallelize(files).foreach(process) Is this the right approach?? -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Spark streaming
Hi, I guess Spark is using streaming in context of streaming live data but what I mean is something more on the lines of hadoop streaming.. where one can code in any programming language? Or is something among that lines on the cards? Thanks -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates