Re: SparkSQL HiveContext No Suitable Driver / Cannot Find Driver

2014-08-30 Thread Denny Lee
Oh, forgot to add the managed libraries and the Hive libraries within the CLASSPATH.  As soon as I did that, we’re good to go now. On August 29, 2014 at 22:55:47, Denny Lee (denny.g@gmail.com) wrote: My issue is similar to the issue as noted 

spark-ec2 [Errno 110] Connection time out

2014-08-30 Thread David Matheson
I'm following the latest documentation on configuring a cluster on ec2 (http://spark.apache.org/docs/latest/ec2-scripts.html). Running ./spark-ec2 -k Blah -i .ssh/Blah.pem -s 2 launch spark-ec2-test gets a generic timeout error that's coming from File ./spark_ec2.py, line 717, in real_main

spark on yarn with hive

2014-08-30 Thread centerqi hu
I want to let hive run on spark and yarn clusters,Hive Metastore is stored in MySQL I compiled spark code: sh make-distribution.sh --hadoop 2.4.1 --with-yarn --skpi-java-test --tgz --with-hive My HQL code: import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ import

Re: Low Level Kafka Consumer for Spark

2014-08-30 Thread Sean Owen
I'm no expert. But as I understand, yes you create multiple streams to consume multiple partitions in parallel. If they're all in the same Kafka consumer group, you'll get exactly one copy of the message so yes if you have 10 consumers and 3 Kafka partitions I believe only 3 will be getting

Re: Spark Hive max key length is 767 bytes

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi Michael, Thank you so much!! I have tried to change the following key length from 256 to 255 and from 767 to 766, it still didn’t work alter table COLUMNS_V2 modify column COMMENT VARCHAR(255); alter table INDEX_PARAMS modify column PARAM_KEY VARCHAR(255); alter table SD_PARAMS modify column

Re: org.apache.spark.examples.xxx

2014-08-30 Thread Akhil Das
It bundles all these src's https://github.com/apache/spark/tree/master/examples together and also it uses the pom file to get the dependencies list if I'm not wrong. Thanks Best Regards On Fri, Aug 29, 2014 at 12:39 AM, filipus floe...@gmail.com wrote: hey guys i still try to get used to

Re: org.apache.spark.examples.xxx

2014-08-30 Thread Ted Yu
bq. how was the spark...example...jar file build? You can use the following command to build against hadoop 2.4: mvn -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests clean package examples jar can be found under examples/target Cheers On Sat, Aug 30, 2014 at 6:54 AM, Akhil Das

Re: org.apache.spark.examples.xxx

2014-08-30 Thread filipus
i try to get use to sbt in order to build stnd allone application by myself the example SimpleApp i managed to run than i tried to copy some example scala program like LinearRegression in a local directory . ./build.sbt ./src ./src/main ./src/main/scala ./src/main/scala/LinearRegression.scala

Mapping Hadoop Reduce to Spark

2014-08-30 Thread Steve Lewis
When programming in Hadoop it is possible to guarantee 1) All keys sent to a specific partition will be handled by the same machine (thread) 2) All keys received by a specific machine (thread) will be received in sorted order 3) These conditions will hold even if the values associated with a

Re: org.apache.spark.examples.xxx

2014-08-30 Thread filipus
compilation works but execution not at least with spark-submit as I described above when I make a local copy of the training set I can execute sbt run file which works sbt run sample_linear_regression_data.txt when I do sbt run ~/git/spark/data/mllib/sample_linear_regression_data.txt the

Re: org.apache.spark.examples.xxx

2014-08-30 Thread Ted Yu
Did you run sbt under /home/filip/spark-ex-regression ? '~/git/spark/data/mllib/sample_linear_regression_data.txt' was interpreted as rooted under /home/filip/spark-ex-regression Cheers On Sat, Aug 30, 2014 at 9:28 AM, filipus floe...@gmail.com wrote: compilation works but execution not at

Re: org.apache.spark.examples.xxx

2014-08-30 Thread filipus
ok I see :-) .. instead of ~ works fine so do you know the reason sbt run [options] works after sbt package but spark-submit --class ClassName --master local[2] target/scala/JarPackage.jar [options] doesnt? it cannot resolve everything somehow -- View this message in context:

Re: Mapping Hadoop Reduce to Spark

2014-08-30 Thread Matei Zaharia
In 1.1, you'll be able to get all of these properties using sortByKey, and then mapPartitions on top to iterate through the key-value pairs. Unfortunately sortByKey does not let you control the Partitioner, but it's fairly easy to write your own version that does if this is important. In

Re: Spark Hive max key length is 767 bytes

2014-08-30 Thread Denny Lee
Oh, you may be running into an issue with your MySQL setup actually, try running alter database metastore_db character set latin1 so that way Hive (and the Spark HiveContext) can execute properly against the metastore. On August 29, 2014 at 04:39:01, arthur.hk.c...@gmail.com

Re: Spark Hive max key length is 767 bytes

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi, Already done but still get the same error: (I use HIVE 0.13.1 Spark 1.0.2, Hadoop 2.4.1) Steps: Step 1) mysql: alter database hive character set latin1; Step 2) HIVE: hive create table test_datatype2 (testbigint bigint ); OK Time taken: 0.708 seconds hive drop table test_datatype2;

Spark Master/Slave and HA

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi, I have few questions about Spark Master and Slave setup: Here, I have 5 Hadoop nodes (n1, n2, n3, n4, and n5 respectively), at the moment I run Spark under these nodes: n1:Hadoop Active Name node, Hadoop Slave Spark Active Master

Spark and Shark Node: RAM Allocation

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi, Is there any formula to calculate proper RAM allocation values for Spark and Shark based on Physical RAM, HADOOP and HBASE RAM usage? e.g. if a node has 32GB physical RAM spark-defaults.conf spark.executor.memory ?g spark-env.sh export SPARK_WORKER_MEMORY=? export

Fwd: What does appMasterRpcPort: -1 indicate ?

2014-08-30 Thread Tao Xiao
I'm using CDH 5.1.0, which bundles Spark 1.0.0 with it. Following How-to: Run a Simple Apache Spark App in CDH 5 http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/ , I tried to submit my job in local mode, Spark Standalone mode and YARN mode. I successfully

Re: Low Level Kafka Consumer for Spark

2014-08-30 Thread Roger Hoover
I have this same question. Isn't there somewhere that the Kafka range metadata can be saved? From my naive perspective, it seems like it should be very similar to HDFS lineage. The original HDFS blocks are kept somewhere (in the driver?) so that if an RDD partition is lost, it can be

Powered By Spark

2014-08-30 Thread Yi Tian
Hi, Could you please add Asiainfo to the Powered By Spark page? Thanks Asiainfo www.asiainfo.com Core, SQL, Streaming, MLlib, GraphX We leverage Spark and Hadoop ecosystem to build cost effective data center solution for our customer in teleco industry as well as other industrial sectors.

Re: saveAsSequenceFile for DStream

2014-08-30 Thread Chris Fregly
couple things to add here: 1) you can import the org.apache.spark.streaming.dstream.PairDStreamFunctions implicit which adds a whole ton of functionality to DStream itself. this lets you work at the DStream level versus digging into the underlying RDDs. 2) you can use ssc.fileStream(directory)

How can a deserialized Java object be stored on disk?

2014-08-30 Thread Tao Xiao
Reading about RDD Persistency https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence, I learned that the storage level MEMORY_AND_DISK means that Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk,

Re: data locality

2014-08-30 Thread Chris Fregly
you can view the Locality Level of each task within a stage by using the Spark Web UI under the Stages tab. levels are as follows (in order of decreasing desirability): 1) PROCESS_LOCAL - data was found directly in the executor JVM 2) NODE_LOCAL - data was found on the same node as the executor

Re: Low Level Kafka Consumer for Spark

2014-08-30 Thread Tim Smith
I'd be interested to understand this mechanism as well. But this is the error recovery part of the equation. Consuming from Kafka has two aspects - parallelism and error recovery and I am not sure how either works. For error recovery, I would like to understand how: - A failed receiver gets