Hi,
I am going thru this example here:
https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py
If I want to write this data on hdfs.
Whats the right way to do this?
Thanks
of jets3t
jar from hadoop/lib to spark/lib (which I believe is of version 0.9ish
version)?
On Wed, Sep 16, 2015 at 4:59 PM, Chengi Liu <chengi.liu...@gmail.com> wrote:
> Hi,
> I have a spark cluster setup and I am trying to write the data to s3 but
> in parquet format.
> Here
Hi,
I have a spark cluster setup and I am trying to write the data to s3 but
in parquet format.
Here is what I am doing
df = sqlContext.load('test', 'com.databricks.spark.avro')
df.saveAsParquetFile("s3n://test")
But I get some nasty error:
Py4JJavaError: An error occurred while calling
of ec2 with all the python libraries installed and
create a bash script to export python_path in the /etc/init.d/ directory.
Then you can launch the cluster with this image and ec2.py
Hope this can be helpful
Cheers
Gen
On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu chengi.liu...@gmail.com
wrote
Hi,
I have written few datastructures as classes like following..
So, here is my code structure:
project/foo/foo.py , __init__.py
/bar/bar.py, __init__.py bar.py imports foo as from foo.foo
import *
/execute/execute.py imports bar as from bar.bar import *
Ultimately I am
-- Forwarded message --
From: Chengi Liu chengi.liu...@gmail.com
Date: Tue, Oct 28, 2014 at 11:23 PM
Subject: Re: sampling in spark
To: Davies Liu dav...@databricks.com
Any suggestions.. Thanks
On Tue, Oct 28, 2014 at 12:53 AM, Chengi Liu chengi.liu...@gmail.com
wrote
Oops, the reference for the above code:
http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945
On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu chengi.liu...@gmail.com
wrote:
Hi,
I have three rdds.. X,y and p
X is matrix rdd (mXn), y
Hi,
I have three rdds.. X,y and p
X is matrix rdd (mXn), y is (mX1) dimension vector
and p is (mX1) dimension probability vector.
Now, I am trying to sample k rows from X and corresponding entries in y
based on probability vector p.
Here is the python implementation
import randomfrom bisect
) if i in index]
On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu chengi.liu...@gmail.com
wrote:
Oops, the reference for the above code:
http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945
On Tue, Oct 28, 2014 at 12:26 AM, Chengi
, reduceBy,
join, sortBy etc.
- If you don't have enough memory and the data is huge, then set the
Storage level to DISK_AND_MEMORY_SER
More you can read over here.
http://spark.apache.org/docs/1.0.0/tuning.html
Thanks
Best Regards
On Sun, Oct 12, 2014 at 10:28 PM, Chengi Liu chengi.liu
/bd591dbe9e2836d9a72b87c3e63e30ffd908dfd6/Benchmark.scala#L30
Thanks
Best Regards
On Mon, Oct 13, 2014 at 12:36 PM, Chengi Liu chengi.liu...@gmail.com
wrote:
Hi Akhil,
Thanks for the response..
Another query... do you know how to use spark.executor.extraJavaOptions
option?
SparkConf.set(spark.executor.extraJavaOptions
Hi,
I posted a query yesterday and have tried out all the options given in
responses..
Basically, I am reading a very fat matrix (2000 by 50 dimension matrix)
and am trying to run kmeans on it.
I keep on getting heap error..
Now, I am even using persist(StorageLevel.DISK_ONLY_2) option..
Hi,
I am trying to use spark but I am having hard time configuring the
sparkconf...
My current conf is
conf =
SparkConf().set(spark.executor.memory,10g).set(spark.akka.frameSize,
1).set(spark.driver.memory,16g)
but I still see the java heap size error
14/10/12 09:54:50 ERROR Executor:
:35 AM, Chengi Liu chengi.liu...@gmail.com
wrote:
So.. same result with parallelize (matrix,1000)
with broadcast.. seems like I got jvm core dump :-/
4/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager
host:47978
with 19.2 GB RAM
14/09/15 02:31:22 INFO BlockManagerInfo
)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print Within Set Sum of Squared Error = + str(WSSSE)
Thanks
Best Regards
On Mon, Sep 15, 2014 at 9:16 AM, Chengi Liu chengi.liu...@gmail.com
wrote:
And the thing
at 2:18 PM, Chengi Liu chengi.liu...@gmail.com
wrote:
Hi Akhil,
So with your config (specifically with set(spark.akka.frameSize ,
1000)) , I see the error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Serialized task 0:0 was 401970046 bytes which exceeds
Hi,
I am trying to create an rdd out of large matrix sc.parallelize
suggest to use broadcast
But when I do
sc.broadcast(data)
I get this error:
Traceback (most recent call last):
File stdin, line 1, in module
File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 370,
in
values.
On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu chengi.liu...@gmail.com wrote:
Hi,
I am trying to create an rdd out of large matrix sc.parallelize
suggest to use broadcast
But when I do
sc.broadcast(data)
I get this error:
Traceback (most recent call last):
File stdin, line
, Sep 14, 2014 at 2:54 PM, Chengi Liu chengi.liu...@gmail.com
wrote:
Specifically the error I see when I try to operate on rdd created by
sc.parallelize method
: org.apache.spark.SparkException: Job aborted due to stage failure:
Serialized task 12:12 was 12062263 bytes which exceeds
)).reduce(lambda x, y: x + y)
print Within Set Sum of Squared Error = + str(WSSSE)
Which is executed as following:
spark-submit --master $SPARKURL clustering_example.py --executor-memory
32G --driver-memory 60G
On Sun, Sep 14, 2014 at 10:47 AM, Chengi Liu chengi.liu...@gmail.com
wrote:
How
dav...@databricks.com wrote:
Hey Chengi,
What's the version of Spark you are using? It have big improvements
about broadcast in 1.1, could you try it?
On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu chengi.liu...@gmail.com
wrote:
Any suggestions.. I am really blocked on this one
On Sun, Sep
And the thing is code runs just fine if I reduce the number of rows in my
data?
On Sun, Sep 14, 2014 at 8:45 PM, Chengi Liu chengi.liu...@gmail.com wrote:
I am using spark1.0.2.
This is my work cluster.. so I can't setup a new version readily...
But right now, I am not using broadcast
Hi,
I have two files..
main_app.py and helper.py
main_app.py calls some functions in helper.py.
I want to use spark-submit to submit a job but how do i specify helper.py?
Basically, how do i specify multiple files in spark?
Thanks
(index, it, foo))
also you can make f become `closure`:
def f2(index, iterator):
yield (index, foo)
rdd.mapPartitionsWithIndex(f2)
On Sun, Aug 17, 2014 at 10:25 AM, Chengi Liu chengi.liu...@gmail.com
wrote:
Hi,
In this example:
http://www.cs.berkeley.edu/~pwendell/strataconf/api
nevermind folks!!!
On Sat, Aug 16, 2014 at 2:22 PM, Chengi Liu chengi.liu...@gmail.com wrote:
Hi,
I have data like following:
1,2,3,4
1,2,3,4
5,6,2,1
and so on..
I would like to create a new rdd as follows:
(0,0,1)
(0,1,2)
(0,2,3)
(0,3,4)
(1,0,1)
.. and so on..
How do i do
Hi,
I am doing some basic preprocessing in pyspark (local mode as follows):
files = [ input files]
def read(filename,sc):
#process file
return rdd
if __name__ ==__main__:
conf = SparkConf()
conf.setMaster('local')
sc = SparkContext(conf =conf)
sc.setCheckpointDir(root+temp/)
Bump
On Tuesday, August 5, 2014, Chengi Liu chengi.liu...@gmail.com wrote:
Hi,
I am doing some basic preprocessing in pyspark (local mode as follows):
files = [ input files]
def read(filename,sc):
#process file
return rdd
if __name__ ==__main__:
conf = SparkConf
Hi,
I have an rdd with n rows and m columns... but most of them are 0 ..
its as sparse matrix..
I would like to only get the non zero entries with their index?
Any equivalent python code would be
for i,x in enumerate(matrix):
for j,y in enumerate(x):
if y:
print i,j,y
Hi,
Lets say I have millions of binary format files... Lets say I have this
java (or python) library which reads and parses these binary formatted
files..
Say
import foo
f = foo.open(filename)
header = f.get_header()
and some other methods..
What I was thinking was to write hadoop input
Hi,
What is the easiest way to skip first n lines in rdd??
I am not able to figure this one out?
Thanks
Hi,
I have a very simple use case:
I have an rdd as following:
d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]]
Now, I want to remove all the duplicates from a column and return the
remaining frame..
For example:
If i want to remove the duplicate based on column 1.
Then basically I would remove either row
are not where they are intended to be, you're just seeing it
fail through all of them. I think it remains a connectivity problem
from your env to the repos, possibly because of a proxy?
--
Sean Owen | Director, Data Science | London
On Mon, Mar 17, 2014 at 8:39 PM, Chengi Liu chengi.liu...@gmail.com
it out by building in command
line..
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Mar 18, 2014 at 2:15 AM, Chengi Liu chengi.liu...@gmail.comwrote:
Hi Sean,
Yeah.. I am seeing erros across all repos and yepp
/index.html, there's a
script to do it.
On Mar 17, 2014, at 9:55 AM, Chengi Liu chengi.liu...@gmail.com wrote:
Hi,
I compiled the spark examples and I see that there are couple of jars
spark-examples_2.10-0.9.0-incubating-sources.jar
spark-examples_2.10-0.9.0-incubating.jar
If I want
.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Mar 17, 2014 at 1:25 PM, Chengi Liu chengi.liu...@gmail.comwrote:
Hi,
I am trying to compile the spark project using sbt/sbt assembly..
And i see this error:
[info
Hi,
A very noob question.. Here is my code in eclipse
import org.apache.spark.SparkContext;
import org.apache.spark.SparkContext._;
object HelloWorld {
def main(args: Array[String]) {
println(Hello, world!)
val sc = new
SparkContext(localhost,wordcount,args(0),Seq(args(1)))
rather than using pyspark shell..
On Wed, Feb 26, 2014 at 9:34 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:
Bad solution is to run a mapper through the data and null the counts ,
good solution is to trim the header before hand without Spark.
On Feb 26, 2014 9:28 AM, Chengi Liu chengi.liu
37 matches
Mail list logo