Probably a noob question.
But I am trying to create a hive table using spark-sql.
Here is what I am trying to do:
hc = HiveContext(sc)
hdf = hc.parquetFile(output_path)
data_types = hdf.dtypes
schema = "(" + " ,".join(map(lambda x: x[0] + " " + x[1], data_types)) +")"
hc.sql(" CREATE TABLE IF
For local machine, I dont think there is any to install.. Just unzip and go
to $SPARK_DIR/bin/spark-shell and that will open up a repl...
On Tue, Feb 10, 2015 at 3:25 PM, King sami kgsam...@gmail.com wrote:
Hi,
I'm new in Spark. I want to install it on my local machine (Ubunti 12.04)
Could
I think you have to run that using $SPARK_HOME/bin/pyspark /path/to/pi.py
instead of normal python pi.py
On Mon, Feb 9, 2015 at 11:22 PM, Ashish Kumar ashish.ku...@innovaccer.com
wrote:
*Command:*
sudo python ./examples/src/main/python/pi.py
*Error:*
Traceback (most recent call last):
Hi,
Probably a naive question.. But I am creating a spark cluster on ec2
using the ec2 scripts in there..
But is there a master param I need to set..
./bin/pyspark --master [ ] ??
I don't yet fully understand the ec2 concepts so just wanted to confirm
this??
Thanks
--
Mohit
When you want
Hi,
I might be asking something very trivial, but whats the recommend way of
using third party libraries.
I am using tables to read hdf5 format file..
And here is the error trace:
print rdd.take(2)
File /tmp/spark/python/pyspark/rdd.py, line , in take
res =
Perhaps, its just me but lag function isnt familiar to me ..
But have you tried configuring the spark appropriately
http://spark.apache.org/docs/latest/configuration.html
On Tue, Oct 14, 2014 at 5:37 PM, Manas Kar manasdebashis...@gmail.com
wrote:
Hi,
I have an RDD containing Vehicle Number
Hi,
I am using pyspark shell and am trying to create an rdd from numpy matrix
rdd = sc.parallelize(matrix)
I am getting the following error:
JVMDUMP039I Processing dump event systhrow, detail
java/lang/OutOfMemoryError at 2014/09/10 22:41:44 - please wait.
JVMDUMP032I JVM requested Heap dump
Hi,
I was wondering if Personalized Page Rank algorithm is implemented in
graphx. If the talks and presentation were to be believed (
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf)
it is.. but cant find the algo code (
Building on what Davies Liu said,
How about something like:
def indexing(splitIndex, iterator,*offset_lists* ):
count = 0
offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0
indexed = []
for i, e in enumerate(iterator):
index = count + offset + i
for j, ele in
On Wed, Aug 6, 2014 at 6:22 PM, Mohit Singh mohit1...@gmail.com wrote:
My naive set up..
Adding
os.environ['SPARK_HOME'] = /path/to/spark
sys.path.append(/path/to/spark/python)
on top of my script.
from pyspark import SparkContext
from pyspark import SparkConf
Execution works from within
My naive set up..
Adding
os.environ['SPARK_HOME'] = /path/to/spark
sys.path.append(/path/to/spark/python)
on top of my script.
from pyspark import SparkContext
from pyspark import SparkConf
Execution works from within pycharm...
Though my next step is to figure out autocompletion and I bet there
One possible straightforward explanation might be your solution(s) might be
stuck in local minima?? And depending on your weights initialization, you
are getting different parameters?
Maybe have same initial weights for both the runs...
or
I would probably test the execution with synthetic
Hi,
We have setup spark on a HPC system and are trying to implement some
data pipeline and algorithms in place.
The input data is in hdf5 (these are very high resolution brain images) and
it can be read via h5py library in python. So, my current approach (which
seems to be working ) is writing
Hi,
I guess Spark is using streaming in context of streaming live data but
what I mean is something more on the lines of hadoop streaming.. where one
can code in any programming language?
Or is something among that lines on the cards?
Thanks
--
Mohit
When you want success as badly as you
14 matches
Mail list logo