Re: Using Spark

2014-06-22 Thread Ricky Thomas
Awesome, thanks On Sunday, June 22, 2014, Matei Zaharia matei.zaha...@gmail.com wrote: Alright, added you. On Jun 20, 2014, at 2:52 PM, Ricky Thomas ri...@truedash.io javascript:_e(%7B%7D,'cvml','ri...@truedash.io'); wrote: Hi, Would like to add ourselves to the user list if possible

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-22 Thread Peng Cheng
Right problem solved in a most disgraceful manner. Just add a package relocation in maven shade config. The downside is that it is not compatible with my IDE (IntelliJ IDEA), will cause: Error:scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found.:

InputStreamsSuite test failed

2014-06-22 Thread crazymb
Hello ,I am a new guy on scala spark, yestday i compile spark from 1.0.0 source code and run test,there is and testcase failed: For example run command in shell : sbt/sbt testOnly org.apache.spark.streaming.InputStreamsSuite the testcase: test(socket input stream) would

Shark vs Impala

2014-06-22 Thread Flavio Pompermaier
Hi folks, I was looking at the benchmark provided by Cloudera at http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/ . Is it real that Shark cannot execute some query if you don't have enough memory? And is it true/reliable that Impala

Re: Shark vs Impala

2014-06-22 Thread Bertrand Dechoux
For the second question, I would say it is mainly because the projects have not the same aim. Impala does have a cost-based optimizer and predicate propagation capability which is natural because it is interpreting pseudo-SQL query. In the realm of relational database, it is often not a good idea

Re: Shark vs Impala

2014-06-22 Thread Toby Douglass
I've just benchmarked Spark and Impala. Same data (in s3), same query, same cluster. Impala has a long load time, since it cannot load directly from s3. I have to create a Hive table on s3, then insert from that to an Impala table. This takes a long time; Spark took about 600s for the query,

Re: Shark vs Impala

2014-06-22 Thread Debasish Das
600s for Spark vs 5s for Redshift...The numbers look much different from the amplab benchmark... https://amplab.cs.berkeley.edu/benchmark/ Is it like SSDs or something that's helping redshift or the whole data is in memory when you run the query ? Could you publish the query ? Also after

MLLib sample data format

2014-06-22 Thread Justin Yip
Hello, I am looking into a couple of MLLib data files in https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any explanation for these files? Does anyone know if they are documented? Thanks. Justin

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
Hi Shuo, Yes. I was reading the guide as well as the sample code. For example, in http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm, nowhere in the github repository I can find the file: sc.textFile( mllib/data/ridge-data/lpsa.data). Thanks. Justin

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
Hi Shuo, Yes. I was reading the guide as well as the sample code. For example, in http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm, now where in the github repository I can find the file: sc.textFile( mllib/data/ridge-data/lpsa.data). Thanks.

Re: MLLib sample data format

2014-06-22 Thread Evan Sparks
These files follow the libsvm format where each line is a record, the first column is a label, and then after that the fields are offset:value where offset is the offset into the feature vector, and value is the value of the input feature. This is a fairly efficient representation for sparse

hi

2014-06-22 Thread rapelly kartheek
Hi Can someone help me with the following error that I faced while setting up single node spark framework. karthik@karthik-OptiPlex-9020:~/spark-1.0.0$ MASTER=spark://localhost:7077 sbin/spark-shell bash: sbin/spark-shell: No such file or directory karthik@karthik-OptiPlex-9020:~/spark-1.0.0$

Persistent Local Node variables

2014-06-22 Thread Daedalus
*TL;DR:* I want to run a pre-processing step on the data from each partition (such as parsing) and retain the parsed object on each node for future processing calls to avoid repeated parsing. /More detail:/ I have a server and two nodes in my cluster, and data partitioned using hdfs. I am trying

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
I see. That's good. Thanks. Justin On Sun, Jun 22, 2014 at 4:59 PM, Evan Sparks evan.spa...@gmail.com wrote: Oh, and the movie lens one is userid::movieid::rating - Evan On Jun 22, 2014, at 3:35 PM, Justin Yip yipjus...@gmail.com wrote: Hello, I am looking into a couple of MLLib data

Re: Persistent Local Node variables

2014-06-22 Thread Daedalus
Will using mapPartitions and creating a new RDD of ParsedData objects avoid multiple parsing? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Persistent-Local-Node-variables-tp8104p8107.html Sent from the Apache Spark User List mailing list archive at

Re: hi

2014-06-22 Thread Akhil Das
Open your webUI in the browser and see the spark url in the top left corner of the page and use it while starting your spark shell instead of localhost:7077. Thanks Best Regards On Mon, Jun 23, 2014 at 10:56 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi Can someone help me with