Hi,
I am sorry for the beginners question but...
I have a spark java code which reads a file (c:\my-input.csv) process it
and writes an output file (my-output.csv)
Now I want to run it on Hadoop in a distributed environment
1) My inlut file should be one big file or separate smaller files?
2) if
Hi, i am a beginner too, but as i have learned, hadoop works better with
big files, at least with 64MB, 128MB or even more. I think you need to
aggregate all the files into a new big one. Then you must copy to HDFS
using this command:
hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE
hadoop just
If you need quick response re-use your spark context between queries and
cache rdds in memory
On Mar 3, 2014 12:42 AM, polkosity polkos...@gmail.com wrote:
Thanks for the advice Mayur.
I thought I'd report back on the performance difference... Spark
standalone
mode has executors processing
Are you running in yarn-standalone mode or yarn-client mode? Also, what
YARN scheduler and what NodeManager heartbeat?
On Sun, Mar 2, 2014 at 9:41 PM, polkosity polkos...@gmail.com wrote:
Thanks for the advice Mayur.
I thought I'd report back on the performance difference... Spark
polkosity, have you seen the job server that Ooyala open sourced? I think
it's very similar to what you're proposing with a REST API and re-using a
SparkContext.
https://github.com/apache/incubator-spark/pull/222
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server
On Mon, Mar
+1
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Mar 3, 2014 at 4:10 PM, polkosity polkos...@gmail.com wrote:
Thats exciting! Will be looking into that, thanks Andrew.
Related topic, has anyone had any
yes, tachyon is in memory serialized, which is not as fast as cached in
memory in spark (not serialized). the difference really depends on your job
type.
On Mon, Mar 3, 2014 at 7:10 PM, polkosity polkos...@gmail.com wrote:
Thats exciting! Will be looking into that, thanks Andrew.
Related
Vector is an enhanced Array[Double]. You can compare it like Array[Double].
E.g.,
scala val v1 = Vector(1.0, 2.0)
v1: org.apache.spark.util.Vector = (1.0, 2.0)
scala val v2 = Vector(1.0, 2.0)
v2: org.apache.spark.util.Vector = (1.0, 2.0)
scala val exactResult =
Where on the filesystem does spark write the shuffle files?