Beginners Hadoop question

2014-03-03 Thread goi cto
Hi, I am sorry for the beginners question but... I have a spark java code which reads a file (c:\my-input.csv) process it and writes an output file (my-output.csv) Now I want to run it on Hadoop in a distributed environment 1) My inlut file should be one big file or separate smaller files? 2) if

Re: Beginners Hadoop question

2014-03-03 Thread Alonso Isidoro Roman
Hi, i am a beginner too, but as i have learned, hadoop works better with big files, at least with 64MB, 128MB or even more. I think you need to aggregate all the files into a new big one. Then you must copy to HDFS using this command: hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE hadoop just

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
If you need quick response re-use your spark context between queries and cache rdds in memory On Mar 3, 2014 12:42 AM, polkosity polkos...@gmail.com wrote: Thanks for the advice Mayur. I thought I'd report back on the performance difference... Spark standalone mode has executors processing

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Sandy Ryza
Are you running in yarn-standalone mode or yarn-client mode? Also, what YARN scheduler and what NodeManager heartbeat? On Sun, Mar 2, 2014 at 9:41 PM, polkosity polkos...@gmail.com wrote: Thanks for the advice Mayur. I thought I'd report back on the performance difference... Spark

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Andrew Ash
polkosity, have you seen the job server that Ooyala open sourced? I think it's very similar to what you're proposing with a REST API and re-using a SparkContext. https://github.com/apache/incubator-spark/pull/222 http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server On Mon, Mar

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Mayur Rustagi
+1 Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Mar 3, 2014 at 4:10 PM, polkosity polkos...@gmail.com wrote: Thats exciting! Will be looking into that, thanks Andrew. Related topic, has anyone had any

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
yes, tachyon is in memory serialized, which is not as fast as cached in memory in spark (not serialized). the difference really depends on your job type. On Mon, Mar 3, 2014 at 7:10 PM, polkosity polkos...@gmail.com wrote: Thats exciting! Will be looking into that, thanks Andrew. Related

Re: o.a.s.u.Vector instances for equality

2014-03-03 Thread Shixiong Zhu
Vector is an enhanced Array[Double]. You can compare it like Array[Double]. E.g., scala val v1 = Vector(1.0, 2.0) v1: org.apache.spark.util.Vector = (1.0, 2.0) scala val v2 = Vector(1.0, 2.0) v2: org.apache.spark.util.Vector = (1.0, 2.0) scala val exactResult =

Shuffle Files

2014-03-03 Thread Usman Ghani
Where on the filesystem does spark write the shuffle files?