Re: Connection pooling in spark jobs

2015-04-03 Thread Charles Feduke
Out of curiosity I wanted to see what JBoss supported in terms of
clustering and database connection pooling since its implementation should
suffice for your use case. I found:

*Note:* JBoss does not recommend using this feature on a production
environment. It requires accessing a connection pool remotely and this is
an anti-pattern as connections are not serializable. Besides, transaction
propagation is not supported and it could lead to connection leaks if the
remote clients are unreliable (i.e crashes, network failure). If you do
need to access a datasource remotely, JBoss recommends accessing it via a
remote session bean facade.[1]

You probably aren't worried about transactions; I gather from your use case
you are just pulling this data in a read only fashion. That being said
JBoss appears to have something.

The other thing to look for is whether or not a solution exists in Hadoop;
I can't find anything for JDBC connection pools over a cluster (just pools
local to a mapper which is similar to what Cody suggested earlier for Spark
and partitions).

If you were talking about a high volume web application then I'd believe
the extra effort for connection pooling [over the cluster] would be worth
it. Unless you're planning on executing several hundred parallel jobs, does
the small amount of overhead outweigh the time necessary to develop a
solution? (I'm guessing a solution doesn't exist because the pattern where
it would be an issue just isn't a common use case for Spark. I went down
this path - connection pooling - myself originally and found a single
connection per executor was fine for my needs. Local connection pools for
the partition as Cody said previously would also work for my use case.)

A local connection pool that was shared amongst all executors on a node
isn't a solution since different jobs execute under different JVMs even
when on the same worker node.[2]

1. https://developer.jboss.org/wiki/ConfigDataSources
2. http://spark.apache.org/docs/latest/cluster-overview.html



On Fri, Apr 3, 2015 at 1:39 AM Sateesh Kavuri sateesh.kav...@gmail.com
wrote:

 Each executor runs for about 5 secs until which time the db connection can
 potentially be open. Each executor will have 1 connection open.
 Connection pooling surely has its advantages of performance and not
 hitting the dbserver for every open/close. The database in question is not
 just used by the spark jobs, but is shared by other systems and so the
 spark jobs have to better at managing the resources.

 I am not really looking for a db connections counter (will let the db
 handle that part), but rather have a pool of connections on spark end so
 that the connections can be reused across jobs


 On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke charles.fed...@gmail.com
 wrote:

 How long does each executor keep the connection open for? How many
 connections does each executor open?

 Are you certain that connection pooling is a performant and suitable
 solution? Are you running out of resources on the database server and
 cannot tolerate each executor having a single connection?

 If you need a solution that limits the number of open connections
 [resource starvation on the DB server] I think you'd have to fake it with a
 centralized counter of active connections, and logic within each executor
 that blocks when the counter is at a given threshold. If the counter is not
 at threshold, then an active connection can be created (after incrementing
 the shared counter). You could use something like ZooKeeper to store the
 counter value. This would have the overall effect of decreasing performance
 if your required number of connections outstrips the database's resources.

 On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 But this basically means that the pool is confined to the job (of a
 single app) in question, but is not sharable across multiple apps?
 The setup we have is a job server (the spark-jobserver) that creates
 jobs. Currently, we have each job opening and closing a connection to the
 database. What we would like to achieve is for each of the jobs to obtain a
 connection from a db pool

 Any directions on how this can be achieved?

 --
 Sateesh

 On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger c...@koeninger.org
 wrote:

 Connection pools aren't serializable, so you generally need to set them
 up inside of a closure.  Doing that for every item is wasteful, so you
 typically want to use mapPartitions or foreachPartition

 rdd.mapPartition { part =
 setupPool
 part.map { ...



 See Design Patterns for using foreachRDD in
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

 On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri 
 sateesh.kav...@gmail.com wrote:

 Right, I am aware on how to use connection pooling with oracle, but
 the specific question is how to use it in the context of spark job 
 execution
 On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com 

Re: Delaying failed task retries + giving failing tasks to different nodes

2015-04-03 Thread Akhil Das
I think these are the following configurations that you are looking for:

*spark.locality.wait*: Number of milliseconds to wait to launch a
data-local task before giving up and launching it on a less-local node. The
same wait will be used to step through multiple locality levels
(process-local, node-local, rack-local and then any). It is also possible
to customize the waiting time for each level by setting
spark.locality.wait.node, etc. You should increase this setting if your
tasks are long and see poor locality, but the default usually works well.

Read more here
https://spark.apache.org/docs/latest/configuration.html#scheduling



Thanks
Best Regards

On Fri, Apr 3, 2015 at 8:41 AM, Stephen Merity step...@commoncrawl.org
wrote:

 Hi there,

 I've been using Spark for processing 33,000 gzipped files that contain
 billions of JSON records (the metadata [WAT] dataset from Common Crawl).
 I've hit a few issues and have not yet found the answers from the
 documentation / search. This may well just be me not finding the right
 pages though I promise I've attempted to RTFM thoroughly!

 Is there any way to (a) ensure retry attempts are done on different nodes
 and/or (b) ensure there's a delay between retrying a failing task (similar
 to spark.shuffle.io.retryWait)?

 Optimally when a task fails it should be given to different executors*.
 This is not the case that I've seen. With maxFailures set to 16, the task
 is handed back to the same executor 16 times, even though there are 89
 other nodes.

 The retry attempts are incredibly fast. The transient issue disappears
 quickly (DNS resolution fails to an Amazon bucket) but the 16 retry
 attempts take less than a second, all run on the same flaky node.

 For now I've set the maxFailures to an absurdly high number and that has
 worked around the issue -- the DNS error disappears on the specified
 machine after ~22 seconds (~360 task attempts) -- but that's obviously
 suboptimal.

 Additionally, are there other options for handling node failures? Other
 than maxFailures I've only seen things relating to shuffle failures? In the
 one instance I've had a node lose communication, it killed the job. I'd
 assumed the RDD would reconstruct. For now I've tried to work around it by
 persisting to multiple machines (MEMORY_AND_DISK_SER_2).

 Thanks! ^_^

 --
 Regards,
 Stephen Merity
 Data Scientist @ Common Crawl



Re: Mllib kmeans #iteration

2015-04-03 Thread amoners
Have you refer to official document of kmeans on
https://spark.apache.org/docs/1.1.1/mllib-clustering.html ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353p22365.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Application Stages and DAG

2015-04-03 Thread Tathagata Das
What he meant is that look it up in the Spark UI, specifically in the Stage
tab to see what is taking so long. And yes code snippet helps us debug.

TD

On Fri, Apr 3, 2015 at 12:47 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 You need open the Stage\'s page which is taking time, and see how long its
 spending on GC etc. Also it will be good to post that Stage and its
 previous transformation's code snippet to make us understand it better.

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 1:05 PM, Vijay Innamuri vijay.innam...@gmail.com
 wrote:


 When I run the Spark application (streaming) in local mode I could see
 the execution progress as below..

 [Stage
 0:
 (1817 + 1) / 3125]
 
 [Stage
 2:===
 (740 + 1) / 3125]

 One of the stages is taking long time for execution.

 How to find the transformations/ actions associated with a particular
 stage?
 Is there anyway to find the execution DAG of a Spark Application?

 Regards
 Vijay





Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Akhil Das
How did you build spark? which version of spark are you having? Doesn't
this thread already explains it?
https://www.mail-archive.com/user@spark.apache.org/msg25505.html

Thanks
Best Regards

On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and jackson or
 json serde jars in the $HIVE/lib directory.  This is hive 0.13.1 and
 spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
 --jars /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new HiveContext(sc)import 
 sqlContext._case class MetricTable(path: String, pathElements: String, name: 
 String, value: String)val mt = new MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the same error:

 15/04/02 12:33:59 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral view 
 json_tuple(pathElements, 'name', 'value') v1 as peName, peValue
 15/04/02 12:34:00 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
 java.lang.ClassNotFoundException: json_tuple

 Any other suggestions or am I doing something else wrong here?

 -Todd



 On Thu, Apr 2, 2015 at 2:00 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Try adding all the jars in your $HIVE/lib directory. If you want the
 specific jar, you could look fr jackson or json serde in it.

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist tsind...@gmail.com wrote:

 I have a feeling I’m missing a Jar that provides the support or could
 this may be related to https://issues.apache.org/jira/browse/SPARK-5792.
 If it is a Jar where would I find that ? I would have thought in the
 $HIVE/lib folder, but not sure which jar contains it.

 Error:

 Create Metric Temporary Table for querying15/04/01 14:41:44 INFO 
 HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore15/04/01 14:41:44 INFO 
 ObjectStore: ObjectStore, initialize called15/04/01 14:41:45 INFO 
 Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will 
 be ignored15/04/01 14:41:45 INFO Persistence: Property 
 datanucleus.cache.level2 unknown - will be ignored15/04/01 14:41:45 INFO 
 BlockManager: Removing broadcast 015/04/01 14:41:45 INFO BlockManager: 
 Removing block broadcast_015/04/01 14:41:45 INFO MemoryStore: Block 
 broadcast_0 of size 1272 dropped from memory (free 278018571)15/04/01 
 14:41:45 INFO BlockManager: Removing block broadcast_0_piece015/04/01 
 14:41:45 INFO MemoryStore: Block broadcast_0_piece0 of size 869 dropped 
 from memory (free 278019440)15/04/01 14:41:45 INFO BlockManagerInfo: 
 Removed broadcast_0_piece0 on 192.168.1.5:63230 in memory (size: 869.0 B, 
 free: 265.1 MB)15/04/01 14:41:45 INFO BlockManagerMaster: Updated info of 
 block broadcast_0_piece015/04/01 14:41:45 INFO BlockManagerInfo: Removed 
 broadcast_0_piece0 on 192.168.1.5:63278 in memory (size: 869.0 B, free: 
 530.0 MB)15/04/01 14:41:45 INFO ContextCleaner: Cleaned broadcast 015/04/01 
 14:41:46 INFO ObjectStore: Setting MetaStore object pin classes with 
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order15/04/01
  14:41:46 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.15/04/01 14:41:46 
 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder 
 is tagged as embedded-only so does not have its own datastore 
 table.15/04/01 14:41:47 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.15/04/01 14:41:47 
 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder 
 is tagged as embedded-only so does not have its 

Re: Spark Streaming Worker runs out of inodes

2015-04-03 Thread Akhil Das
Did you try these?

- Disable shuffle : spark.shuffle.spill=false
- Enable log rotation:

sparkConf.set(spark.executor.logs.rolling.strategy, size)
.set(spark.executor.logs.rolling.size.maxBytes, 1024)
.set(spark.executor.logs.rolling.maxRetainedFiles, 3)


Thanks
Best Regards

On Fri, Apr 3, 2015 at 9:09 AM, a mesar amesa...@gmail.com wrote:

 Yes, with spark.cleaner.ttl set there is no cleanup.  We pass 
 --properties-file
 spark-dev.conf to spark-submit where  spark-dev.conf contains:

 spark.master spark://10.250.241.66:7077
 spark.logConf true
 spark.cleaner.ttl 1800
 spark.executor.memory 10709m
 spark.cores.max 4
 spark.shuffle.consolidateFiles true

 On Thu, Apr 2, 2015 at 7:12 PM, Tathagata Das t...@databricks.com wrote:

 Are you saying that even with the spark.cleaner.ttl set your files are
 not getting cleaned up?

 TD

 On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote:

 Apparently Spark Streaming 1.3.0 is not cleaning up its internal files
 and
 the worker nodes eventually run out of inodes.
 We see tons of old shuffle_*.data and *.index files that are never
 deleted.
 How do we get Spark to remove these files?

 We have a simple standalone app with one RabbitMQ receiver and a two node
 cluster (2 x r3large AWS instances).
 Batch interval is 10 minutes after which we process data and write
 results
 to DB. No windowing or state mgmt is used.

 I've poured over the documentation and tried setting the following
 properties but they have not helped.
 As a work around we're using a cron script that periodically cleans up
 old
 files but this has a bad smell to it.

 SPARK_WORKER_OPTS in spark-env.sh on every worker node
   spark.worker.cleanup.enabled true
   spark.worker.cleanup.interval
   spark.worker.cleanup.appDataTtl

 Also tried on the driver side:
   spark.cleaner.ttl
   spark.shuffle.consolidateFiles true



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Worker-runs-out-of-inodes-tp22355.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Spark Streaming Worker runs out of inodes

2015-04-03 Thread Charles Feduke
You could also try setting your `nofile` value in /etc/security/limits.conf
for `soft` to some ridiculously high value if you haven't done so already.

On Fri, Apr 3, 2015 at 2:09 AM Akhil Das ak...@sigmoidanalytics.com wrote:

 Did you try these?

 - Disable shuffle : spark.shuffle.spill=false
 - Enable log rotation:

 sparkConf.set(spark.executor.logs.rolling.strategy, size)
 .set(spark.executor.logs.rolling.size.maxBytes, 1024)
 .set(spark.executor.logs.rolling.maxRetainedFiles, 3)


 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 9:09 AM, a mesar amesa...@gmail.com wrote:

 Yes, with spark.cleaner.ttl set there is no cleanup.  We pass 
 --properties-file
 spark-dev.conf to spark-submit where  spark-dev.conf contains:

 spark.master spark://10.250.241.66:7077
 spark.logConf true
 spark.cleaner.ttl 1800
 spark.executor.memory 10709m
 spark.cores.max 4
 spark.shuffle.consolidateFiles true

 On Thu, Apr 2, 2015 at 7:12 PM, Tathagata Das t...@databricks.com
 wrote:

 Are you saying that even with the spark.cleaner.ttl set your files are
 not getting cleaned up?

 TD

 On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote:

 Apparently Spark Streaming 1.3.0 is not cleaning up its internal files
 and
 the worker nodes eventually run out of inodes.
 We see tons of old shuffle_*.data and *.index files that are never
 deleted.
 How do we get Spark to remove these files?

 We have a simple standalone app with one RabbitMQ receiver and a two
 node
 cluster (2 x r3large AWS instances).
 Batch interval is 10 minutes after which we process data and write
 results
 to DB. No windowing or state mgmt is used.

 I've poured over the documentation and tried setting the following
 properties but they have not helped.
 As a work around we're using a cron script that periodically cleans up
 old
 files but this has a bad smell to it.

 SPARK_WORKER_OPTS in spark-env.sh on every worker node
   spark.worker.cleanup.enabled true
   spark.worker.cleanup.interval
   spark.worker.cleanup.appDataTtl

 Also tried on the driver side:
   spark.cleaner.ttl
   spark.shuffle.consolidateFiles true



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Spark-Streaming-Worker-runs-out-of-
 inodes-tp22355.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







About Waiting batches on the spark streaming UI

2015-04-03 Thread bit1...@163.com

I copied the following from the spark streaming UI, I don't know why the 
Waiting batches is 1, my understanding is that it should be 72.
Following  is my understanding: 
1. Total time is 1minute 35 seconds=95 seconds
2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds.
3. Processed batches are 23(Correct, because in my processing code, it does 
nothing but sleep 4 seconds)
4. Then the waiting batches should be 95-23=72


Started at: Fri Apr 03 15:17:47 CST 2015 
Time since start: 1 minute 35 seconds 
Network receivers: 1 
Batch interval: 1 second 
Processed batches: 23 
Waiting batches: 1 
Received records: 0 
Processed records: 0   



bit1...@163.com


Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-03 Thread Jaonary Rabarisoa
Good! Thank you.

On Thu, Apr 2, 2015 at 9:05 AM, Xiangrui Meng men...@gmail.com wrote:

 I reproduced the bug on master and submitted a patch for it:
 https://github.com/apache/spark/pull/5329. It may get into Spark
 1.3.1. Thanks for reporting the bug! -Xiangrui

 On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:
  Hmm, I got the same error with the master. Here is another test example
 that
  fails. Here, I explicitly create
  a Row RDD which corresponds to the use case I am in :
 
  object TestDataFrame {
 
def main(args: Array[String]): Unit = {
 
  val conf = new
  SparkConf().setAppName(TestDataFrame).setMaster(local[4])
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)
 
  import sqlContext.implicits._
 
  val data = Seq(LabeledPoint(1, Vectors.zeros(10)))
  val dataDF = sc.parallelize(data).toDF
 
  dataDF.printSchema()
  dataDF.save(test1.parquet) // OK
 
  val dataRow = data.map {case LabeledPoint(l: Double, f:
  mllib.linalg.Vector)=
Row(l,f)
  }
 
  val dataRowRDD = sc.parallelize(dataRow)
  val dataDF2 = sqlContext.createDataFrame(dataRowRDD, dataDF.schema)
 
  dataDF2.printSchema()
 
  dataDF2.saveAsParquetFile(test3.parquet) // FAIL !!!
}
  }
 
 
  On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng men...@gmail.com
 wrote:
 
  I cannot reproduce this error on master, but I'm not aware of any
  recent bug fixes that are related. Could you build and try the current
  master? -Xiangrui
 
  On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com
  wrote:
   Hi all,
  
   DataFrame with an user defined type (here mllib.Vector) created with
   sqlContex.createDataFrame can't be saved to parquet file and raise
   ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot
 be
   cast
   to org.apache.spark.sql.Row error.
  
   Here is an example of code to reproduce this error :
  
   object TestDataFrame {
  
 def main(args: Array[String]): Unit = {
   //System.loadLibrary(Core.NATIVE_LIBRARY_NAME)
   val conf = new
   SparkConf().setAppName(RankingEval).setMaster(local[8])
 .set(spark.executor.memory, 6g)
  
   val sc = new SparkContext(conf)
   val sqlContext = new SQLContext(sc)
  
   import sqlContext.implicits._
  
   val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10
   val dataDF = data.toDF
  
   dataDF.save(test1.parquet)
  
   val dataDF2 = sqlContext.createDataFrame(dataDF.rdd,
 dataDF.schema)
  
   dataDF2.save(test2.parquet)
 }
   }
  
  
   Is this related to https://issues.apache.org/jira/browse/SPARK-5532
 and
   how
   can it be solved ?
  
  
   Cheers,
  
  
   Jao
 
 



Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
I placed it there.  It was downloaded from MySql site.

On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Akhil
 you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
 how come you got this lib into spark/lib folder.
 1) did you place it there ?
 2) What is download location ?


 On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
 --jars /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName 
 |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
 cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.
 1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having?
 Doesn't this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the same 

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
Hi Deepujain,

I did include the jar file, I believe it is hive-exe.jar, through the
--jars option:

./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars
/opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

Results in the same error.  I'm going to do the rebuild in a few minutes.

Thanks for the assistance.

-Todd



On Fri, Apr 3, 2015 at 6:30 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I think you need to include the jar file through --jars option that
 contains the hive definition (code) of UDF json_tuple. That should solve
 your problem.

 On Fri, Apr 3, 2015 at 3:57 PM, Todd Nist tsind...@gmail.com wrote:

 I placed it there.  It was downloaded from MySql site.

 On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Akhil
 you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
 . how come you got this lib into spark/lib folder.
 1) did you place it there ?
 2) What is download location ?


 On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName   
   |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan 
 ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
 cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-
 5.1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  
 I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just 
 by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
  wrote:

 How did you build spark? which version of spark are you having?
 Doesn't this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com
 wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is 
 hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing 

Re: 答复:maven compile error

2015-04-03 Thread Ted Yu
Can you include -X in your maven command and pastebin the output ?

Cheers



 On Apr 3, 2015, at 3:58 AM, myelinji myeli...@aliyun.com wrote:
 
 Thank you for your reply. When I'm using maven to compile the whole project, 
 the erros as follows
 
 [INFO] Spark Project Parent POM .. SUCCESS [4.136s]
 [INFO] Spark Project Networking .. SUCCESS [7.405s]
 [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [5.071s]
 [INFO] Spark Project Core  SUCCESS [3:08.445s]
 [INFO] Spark Project Bagel ... SUCCESS [21.613s]
 [INFO] Spark Project GraphX .. SUCCESS [58.915s]
 [INFO] Spark Project Streaming ... SUCCESS [1:26.202s]
 [INFO] Spark Project Catalyst  FAILURE [1.537s]
 [INFO] Spark Project SQL . SKIPPED
 [INFO] Spark Project ML Library .. SKIPPED
 [INFO] Spark Project Tools ... SKIPPED
 [INFO] Spark Project Hive  SKIPPED
 [INFO] Spark Project REPL  SKIPPED
 [INFO] Spark Project Assembly  SKIPPED
 [INFO] Spark Project External Twitter  SKIPPED
 [INFO] Spark Project External Flume Sink . SKIPPED
 [INFO] Spark Project External Flume .. SKIPPED
 [INFO] Spark Project External MQTT ... SKIPPED
 [INFO] Spark Project External ZeroMQ . SKIPPED
 [INFO] Spark Project External Kafka .. SKIPPED
 [INFO] Spark Project Examples  SKIPPED
 
 it seems like there is something wrong with calatlyst project. Why i cannot 
 compile this project?
 
 
 --
 发件人:Sean Owen so...@cloudera.com
 发送时间:2015年4月3日(星期五) 17:48
 收件人:myelinji myeli...@aliyun.com
 抄 送:spark用户组 user@spark.apache.org
 主 题:Re: maven compile error
 
 If you're asking about a compile error, you should include the command
 you used to compile.
 
 I am able to compile branch 1.2 successfully with mvn -DskipTests
 clean package.
 
 This error is actually an error from scalac, not a compile error from
 the code. It sort of sounds like it has not been able to download
 scala dependencies. Check or maybe recreate your environment.
 
 On Fri, Apr 3, 2015 at 3:19 AM, myelinji myeli...@aliyun.com wrote:
  Hi,all:
  Just now i checked out spark-1.2 on github , wanna to build it use maven,
  how ever I encountered an error during compiling:
 
  [INFO]
  
  [ERROR] Failed to execute goal
  net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on
  project spark-catalyst_2.10: wrap:
  scala.reflect.internal.MissingRequirementError: object scala.runtime in
  compiler mirror not found. - [Help 1]
  org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
  goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile
  (scala-compile-first) on project spark-catalyst_2.10: wrap:
  scala.reflect.internal.MissingRequirementError: object scala.runtime in
  compiler mirror not found.
  at
  org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
  at
  org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
  at
  org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
  at
  org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
  at
  org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
  at
  org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
  at
  org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
  at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
  at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
  at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
  at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
  at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
  at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
  Caused by: 

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread ๏̯͡๏
I think you need to include the jar file through --jars option that
contains the hive definition (code) of UDF json_tuple. That should solve
your problem.

On Fri, Apr 3, 2015 at 3:57 PM, Todd Nist tsind...@gmail.com wrote:

 I placed it there.  It was downloaded from MySql site.

 On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Akhil
 you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
 how come you got this lib into spark/lib folder.
 1) did you place it there ?
 2) What is download location ?


 On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName
  |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan 
 ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
 cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5
 .1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just 
 by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having?
 Doesn't this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com
 wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, 

Which OS for Spark cluster nodes?

2015-04-03 Thread Horsmann, Tobias
Hi,
Are there any recommendations for operating systems that one should use for 
setting up Spark/Hadoop nodes in general?
I am not familiar with the differences between the various linux distributions 
or how well they are (not) suited for cluster set-ups, so I wondered if there 
is some preferred choices?

Regards,



Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
Started the spark shell with the one jar from hive suggested:

./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars
/opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

Results in the same error:

scala sql( | SELECT path, name, value, v1.peValue,
v1.peName |  FROM metric_table |lateral
view json_tuple(pathElements, 'name', 'value') v1 |
as peName, peValue | )
15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path,
name, value, v1.peValue, v1.peName FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
as peName, peValue
15/04/03 06:01:31 INFO ParseDriver: Parse Completed
res2: org.apache.spark.sql.SchemaRDD =
SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
java.lang.ClassNotFoundException: json_tuple

I will try the rebuild.  Thanks again for the assistance.

-Todd


On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores
 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin
 .jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference was
 me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I did
 not build Spark but used the version from the Spark download site for 1.2.1
 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in thread
 main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in the
 $HIVE/lib directory did not seem to work.  Though when looking in $HIVE/lib
 for 0.13.1, I do not see any json serde or jackson files.  I do see that
 hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on the
 version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having? Doesn't
 this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and jackson
 or json serde jars in the $HIVE/lib directory.  This is hive 0.13.1 and
 spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the same error:

 15/04/02 12:33:59 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/02 12:34:00 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan 
 ==
 java.lang.ClassNotFoundException: json_tuple

 

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Akhil Das
Copy pasted his command in the same thread.

Thanks
Best Regards

On Fri, Apr 3, 2015 at 3:55 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Akhil
 you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
 how come you got this lib into spark/lib folder.
 1) did you place it there ?
 2) What is download location ?


 On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
 --jars /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName 
 |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
 cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.
 1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having?
 Doesn't this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the 

Re: Which OS for Spark cluster nodes?

2015-04-03 Thread Akhil Das
There isn't any specific Linux distro, but i would prefer Ubuntu for a
beginner as its very easy to apt-get install stuffs on it.

Thanks
Best Regards

On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias tobias.horsm...@uni-due.de
 wrote:

  Hi,
 Are there any recommendations for operating systems that one should use
 for setting up Spark/Hadoop nodes in general?
 I am not familiar with the differences between the various linux
 distributions or how well they are (not) suited for cluster set-ups, so I
 wondered if there is some preferred choices?

  Regards,




Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Akhil Das
This thread might give you some insights
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E

Thanks
Best Regards

On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 My Spark Job failed with


 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed:
 saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0)
 had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
 null, currPsLvlId: null}))
 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0.0 in stage 2.0 (TID 0) had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
 null, currPsLvlId: null}))
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED,
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage
 failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:

 

 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto
 generated through avro schema using avro-generate-sources maven pulgin.


 package com.ebay.ep.poc.spark.reporting.process.model.dw;

 @SuppressWarnings(all)

 @org.apache.avro.specific.AvroGenerated

 public class SpsLevelMetricSum extends
 org.apache.avro.specific.SpecificRecordBase implements
 org.apache.avro.specific.SpecificRecord {
 ...
 ...
 }

 Can anyone suggest how to fix this ?



 --
 Deepak




Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
What version of Cassandra are you using?  Are you using DSE or the stock
Apache Cassandra version?  I have connected it with DSE, but have not
attempted it with the standard Apache Cassandra version.

FWIW,
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
provide all the goodness of Spark.  Are you attempting to leverage the
spark-cassandra-connector for this?



On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
wrote:

  Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed





Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Aung Htet
Hi Debasish, Charles,

I solved the problem by using a BPQ like method, based on your suggestions.
So thanks very much for that!

My approach was
1) Count the population of each segment in the RDD by map/reduce so that I
get the bound number N equivalent to 10% of each segment. This becomes the
size of the BPQ.
2) Associate the bounds N to the corresponding records in the first RDD.
3) Reduce the RDD from step 2 by merging the values in every two rows,
basically creating a sorted list (Indexed Seq)
4) If the size of the sorted list is greater than N (the bound) then,
create a new sorted list by using a priority queue and dequeuing top N
values.

In the end, I get a record for each segment with N max values for each
segment.

Regards,
Aung








On Fri, Mar 27, 2015 at 4:27 PM, Debasish Das debasish.da...@gmail.com
wrote:

 In that case you can directly use count-min-sketch from algebirdthey
 work fine with Spark aggregateBy but I have found the java BPQ from Spark
 much faster than say algebird Heap datastructure...

 On Thu, Mar 26, 2015 at 10:04 PM, Charles Hayden 
 charles.hay...@atigeo.com wrote:

  ​You could also consider using a count-min data structure such as in
 https://github.com/laserson/dsq​

 to get approximate quantiles, then use whatever values you want to filter
 the original sequence.
  --
 *From:* Debasish Das debasish.da...@gmail.com
 *Sent:* Thursday, March 26, 2015 9:45 PM
 *To:* Aung Htet
 *Cc:* user
 *Subject:* Re: How to get a top X percent of a distribution represented
 as RDD

  Idea is to use a heap and get topK elements from every partition...then
 use aggregateBy and for combOp do a merge routine from
 mergeSort...basically get 100 items from partition 1, 100 items from
 partition 2, merge them so that you get sorted 200 items and take 100...for
 merge you can use heap as well...Matei had a BPQ inside Spark which we use
 all the time...Passing arrays over wire is better than passing full heap
 objects and merge routine on array should run faster but needs experiment...

 On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet aung@gmail.com wrote:

 Hi Debasish,

 Thanks for your suggestions. In-memory version is quite useful. I do not
 quite understand how you can use aggregateBy to get 10% top K elements. Can
 you please give an example?

 Thanks,
 Aung

 On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 You can do it in-memory as wellget 10% topK elements from each
 partition and use merge from any sort algorithm like timsortbasically
 aggregateBy

  Your version uses shuffle but this version is 0 shuffle..assuming
 your data set is cached you will be using in-memory allReduce through
 treeAggregate...

  But this is only good for top 10% or bottom 10%...if you need to do
 it for top 30% then may be the shuffle version will work better...

 On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet aung@gmail.com wrote:

 Hi all,

  I have a distribution represented as an RDD of tuples, in rows of
 (segment, score)
 For each segment, I want to discard tuples with top X percent scores.
 This seems hard to do in Spark RDD.

  A naive algorithm would be -

  1) Sort RDD by segment  score (descending)
 2) Within each segment, number the rows from top to bottom.
 3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut
 off out of a segment with 100 rows.
 4) For the entire RDD, filter rows with row num = cut off index

 This does not look like a good algorithm. I would really appreciate if
 someone can suggest a better way to implement this in Spark.

  Regards,
 Aung








Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Dean Wampler
A hack workaround is to use flatMap:

rdd.flatMap{ case (date, array) = for (x - array) yield (date, x) }

For those of you who don't know Scala, the for comprehension iterates
through the ArrayBuffer, named array and yields new tuples with the date
and each element. The case expression to the left of the = pattern matches
on the input tuples.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee denny.g@gmail.com wrote:

 Thanks Michael - that was it!  I was drawing a blank on this one for some
 reason - much appreciated!


 On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com
 wrote:

 A lateral view explode using HiveQL.  I'm hopping to add explode
 shorthand directly to the df API in 1.4.

 On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com wrote:

 Quick question - the output of a dataframe is in the format of:

 [2015-04, ArrayBuffer(A, B, C, D)]

 and I'd like to return it as:

 2015-04, A
 2015-04, B
 2015-04, C
 2015-04, D

 What's the best way to do this?

 Thanks in advance!






Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Deepak Jain
I was able to write record that extends specificrecord (avro) this class was 
not auto generated. Do we need to do something extra for auto generated classes 

Sent from my iPhone

 On 03-Apr-2015, at 5:06 pm, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 This thread might give you some insights 
 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E
 
 Thanks
 Best Regards
 
 On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
 My Spark Job failed with
 
 
 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed: 
 saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception: 
 Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0) had a not 
 serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
  - object not serializable (class: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt: 
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
  - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
  - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0, 
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt: 
 null, currPsLvlId: null}))
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
 in stage 2.0 (TID 0) had a not serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
  - object not serializable (class: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt: 
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
  - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
  - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0, 
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt: 
 null, currPsLvlId: null}))
  at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
  at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
  at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
  at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
  at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
  at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
  at scala.Option.foreach(Option.scala:236)
  at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
  at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
  at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage 
 failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 
 
 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto 
 generated through avro schema using avro-generate-sources maven pulgin.
 
 
 package com.ebay.ep.poc.spark.reporting.process.model.dw;  
 
 @SuppressWarnings(all)
 
 @org.apache.avro.specific.AvroGenerated
 
 public class SpsLevelMetricSum extends 
 org.apache.avro.specific.SpecificRecordBase implements 
 org.apache.avro.specific.SpecificRecord {
 
 ...
 ...
 }
 
 Can anyone suggest how to fix this ?
 
 
 
 -- 
 Deepak
 


Re: Spark 1.3 UDF ClassNotFoundException

2015-04-03 Thread Markus Ganter
My apologizes. I was running this locally and the JAR I was building
using Intellij had some issues.
This was not related to UDFs. All works fine now.

On Thu, Apr 2, 2015 at 2:58 PM, Ted Yu yuzhih...@gmail.com wrote:
 Can you show more code in CreateMasterData ?

 How do you run your code ?

 Thanks

 On Thu, Apr 2, 2015 at 11:06 AM, ganterm gant...@gmail.com wrote:

 Hello,

 I started to use the dataframe API in Spark 1.3 with Scala.
 I am trying to implement a UDF and am following the sample here:

 https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.UserDefinedFunction
 meaning
 val predict = udf((score: Double) = if (score  0.5) true else false)
 df.select( predict(df(score)) )
 All compiles just fine but when I run it, I get a ClassNotFoundException
 (see more details below)
 I am sure that I load the data correctly and that I have a field called
 score with the correct data type.
 Do I need to do anything else like registering the function?

 Thanks!
 Markus

 Exception in thread main org.apache.spark.SparkException: Job aborted
 due
 to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure:
 Lost task 0.3 in stage 6.0 (TID 11, BillSmithPC):
 java.lang.ClassNotFoundException: test.CreateMasterData$$anonfun$1
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at

 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65)
 ...




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-UDF-ClassNotFoundException-tp22361.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Parquet timestamp support for Hive?

2015-04-03 Thread Rex Xiong
Hi,

I got this error when creating a hive table from parquet file:
DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.UnsupportedOperationException: Parquet does not support
timestamp. See HIVE-6384

I check HIVE-6384, it's fixed in 0.14.
The hive in spark build is a customized version 0.13.1a
(GroupId: org.spark-project.hive), is it possible to get the source code
for it and apply patch from HIVE-6384?

Thanks


Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Deepak Jain
I meant that I did not have to use kyro. Why will kyro help fix this issue now ?

Sent from my iPhone

 On 03-Apr-2015, at 5:36 pm, Deepak Jain deepuj...@gmail.com wrote:
 
 I was able to write record that extends specificrecord (avro) this class was 
 not auto generated. Do we need to do something extra for auto generated 
 classes 
 
 Sent from my iPhone
 
 On 03-Apr-2015, at 5:06 pm, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 This thread might give you some insights 
 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E
 
 Thanks
 Best Regards
 
 On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
 My Spark Job failed with
 
 
 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed: 
 saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception: 
 Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0) had a not 
 serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt: 
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0, 
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt: 
 null, currPsLvlId: null}))
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
 in stage 2.0 (TID 0) had a not serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt: 
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0, 
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt: 
 null, currPsLvlId: null}))
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage 
 failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 
 
 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto 
 generated through avro schema using avro-generate-sources maven pulgin.
 
 
 package com.ebay.ep.poc.spark.reporting.process.model.dw;  
 
 @SuppressWarnings(all)
 
 @org.apache.avro.specific.AvroGenerated
 
 public class SpsLevelMetricSum extends 
 org.apache.avro.specific.SpecificRecordBase implements 
 org.apache.avro.specific.SpecificRecord {
 
 ...
 ...
 }
 
 Can anyone suggest how to fix this ?
 
 
 
 -- 
 Deepak
 


Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Akhil Das
Because, its throwing up serializable exceptions and kryo is a serializer
to serialize your objects.

Thanks
Best Regards

On Fri, Apr 3, 2015 at 5:37 PM, Deepak Jain deepuj...@gmail.com wrote:

 I meant that I did not have to use kyro. Why will kyro help fix this issue
 now ?

 Sent from my iPhone

 On 03-Apr-2015, at 5:36 pm, Deepak Jain deepuj...@gmail.com wrote:

 I was able to write record that extends specificrecord (avro) this class
 was not auto generated. Do we need to do something extra for auto generated
 classes

 Sent from my iPhone

 On 03-Apr-2015, at 5:06 pm, Akhil Das ak...@sigmoidanalytics.com wrote:

 This thread might give you some insights
 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 My Spark Job failed with


 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed:
 saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0)
 had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
 null, currPsLvlId: null}))
 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0.0 in stage 2.0 (TID 0) had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
 null, currPsLvlId: null}))
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED,
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage
 failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:

 

 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is
 auto generated through avro schema using avro-generate-sources maven pulgin.


 package com.ebay.ep.poc.spark.reporting.process.model.dw;

 @SuppressWarnings(all)

 @org.apache.avro.specific.AvroGenerated

 public class SpsLevelMetricSum extends
 org.apache.avro.specific.SpecificRecordBase implements
 org.apache.avro.specific.SpecificRecord {
 ...
 ...
 }

 Can anyone suggest how to fix this ?



 --
 Deepak





RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
Thanks mohammed. Will give it a try today. We would also need the sparksSQL
piece as we are migrating our data store from oracle to C* and it would be
easier to maintain all the reports rather recreating each one from scratch.

Thanks,
Pawan Venugopal.
On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using DSE.
 Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once
 you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the stock
 Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.



 FWIW,
 http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
 provide all the goodness of Spark.  Are you attempting to leverage the
 spark-cassandra-connector for this?







 On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed







Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Debasish Das
Cool !

You should also consider to contribute it back to spark if you are doing
quantile calculations for example...there is also topbykey api added in
master by @coderxiangsee if you can use that API to make the code
clean
On Apr 3, 2015 5:20 AM, Aung Htet aung@gmail.com wrote:

 Hi Debasish, Charles,

 I solved the problem by using a BPQ like method, based on your
 suggestions. So thanks very much for that!

 My approach was
 1) Count the population of each segment in the RDD by map/reduce so that I
 get the bound number N equivalent to 10% of each segment. This becomes the
 size of the BPQ.
 2) Associate the bounds N to the corresponding records in the first RDD.
 3) Reduce the RDD from step 2 by merging the values in every two rows,
 basically creating a sorted list (Indexed Seq)
 4) If the size of the sorted list is greater than N (the bound) then,
 create a new sorted list by using a priority queue and dequeuing top N
 values.

 In the end, I get a record for each segment with N max values for each
 segment.

 Regards,
 Aung








 On Fri, Mar 27, 2015 at 4:27 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 In that case you can directly use count-min-sketch from algebirdthey
 work fine with Spark aggregateBy but I have found the java BPQ from Spark
 much faster than say algebird Heap datastructure...

 On Thu, Mar 26, 2015 at 10:04 PM, Charles Hayden 
 charles.hay...@atigeo.com wrote:

  ​You could also consider using a count-min data structure such as in
 https://github.com/laserson/dsq​

 to get approximate quantiles, then use whatever values you want to
 filter the original sequence.
  --
 *From:* Debasish Das debasish.da...@gmail.com
 *Sent:* Thursday, March 26, 2015 9:45 PM
 *To:* Aung Htet
 *Cc:* user
 *Subject:* Re: How to get a top X percent of a distribution represented
 as RDD

  Idea is to use a heap and get topK elements from every
 partition...then use aggregateBy and for combOp do a merge routine from
 mergeSort...basically get 100 items from partition 1, 100 items from
 partition 2, merge them so that you get sorted 200 items and take 100...for
 merge you can use heap as well...Matei had a BPQ inside Spark which we use
 all the time...Passing arrays over wire is better than passing full heap
 objects and merge routine on array should run faster but needs experiment...

 On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet aung@gmail.com wrote:

 Hi Debasish,

 Thanks for your suggestions. In-memory version is quite useful. I do
 not quite understand how you can use aggregateBy to get 10% top K elements.
 Can you please give an example?

 Thanks,
 Aung

 On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das debasish.da...@gmail.com
  wrote:

 You can do it in-memory as wellget 10% topK elements from each
 partition and use merge from any sort algorithm like timsortbasically
 aggregateBy

  Your version uses shuffle but this version is 0 shuffle..assuming
 your data set is cached you will be using in-memory allReduce through
 treeAggregate...

  But this is only good for top 10% or bottom 10%...if you need to do
 it for top 30% then may be the shuffle version will work better...

 On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet aung@gmail.com wrote:

 Hi all,

  I have a distribution represented as an RDD of tuples, in rows of
 (segment, score)
 For each segment, I want to discard tuples with top X percent scores.
 This seems hard to do in Spark RDD.

  A naive algorithm would be -

  1) Sort RDD by segment  score (descending)
 2) Within each segment, number the rows from top to bottom.
 3) For each  segment, calculate the cut off index. i.e. 90 for 10%
 cut off out of a segment with 100 rows.
 4) For the entire RDD, filter rows with row num = cut off index

 This does not look like a good algorithm. I would really appreciate
 if someone can suggest a better way to implement this in Spark.

  Regards,
 Aung









Re: Which OS for Spark cluster nodes?

2015-04-03 Thread Charles Feduke
As Akhil says Ubuntu is a good choice if you're starting from near scratch.

Cloudera CDH virtual machine images[1] include Hadoop, HDFS, Spark, and
other big data tools so you can get a cluster running with very little
effort. Keep in mind Cloudera is a for-profit corporation so they are also
selling a product.

Personally I prefer the EC2 scripts[2] that ship with the downloadable
Spark distribution. It provisions a cluster for you on AWS and you can
easily terminate the cluster when you don't need it. Ganglia (monitoring),
HDFS (ephemeral and EBS backed), Tachyon (caching), and Spark are all
installed automatically. For learning, using a cluster of 4 medium machines
is fairly inexpensive. (I use the EC2 scripts for both an integration and
production environment.)

1.
http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html
2. https://spark.apache.org/docs/latest/ec2-scripts.html

On Fri, Apr 3, 2015 at 7:38 AM Akhil Das ak...@sigmoidanalytics.com wrote:

 There isn't any specific Linux distro, but i would prefer Ubuntu for a
 beginner as its very easy to apt-get install stuffs on it.

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias 
 tobias.horsm...@uni-due.de wrote:

  Hi,
 Are there any recommendations for operating systems that one should use
 for setting up Spark/Hadoop nodes in general?
 I am not familiar with the differences between the various linux
 distributions or how well they are (not) suited for cluster set-ups, so I
 wondered if there is some preferred choices?

  Regards,





Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
Hi Todd,

Thanks for the link. I would be interested in this solution. I am using DSE
for cassandra. Would you provide me with info on connecting with DSE either
through Tableau or zeppelin. The goal here is query cassandra through spark
sql so that I could perform joins and groupby on my queries. Are you able
to perform spark sql queries with tableau?

Thanks,
Pawan Venugopal
On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the stock
 Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.

 FWIW,
 http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
 provide all the goodness of Spark.  Are you attempting to leverage the
 spark-cassandra-connector for this?



 On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

  Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed







RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Mohammed Guller
Hi Todd,

We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly with C* 
using the ODBC driver, but now would like to add Spark SQL to the mix. I 
haven’t been able to find any documentation for how to make this combination 
work.

We are using the Spark-Cassandra-Connector in our applications, but haven’t 
been able to figure out how to get the Spark SQL Thrift Server to use it and 
connect to C*. That is the missing piece. Once we solve that piece of the 
puzzle then Tableau should be able to see the tables in C*.

Hi Pawan,
Tableau + C* is pretty straight forward, especially if you are using DSE. 
Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once you 
connect, Tableau allows to use C* keyspace as schema and column families as 
tables.

Mohammed

From: pawan kumar [mailto:pkv...@gmail.com]
Sent: Friday, April 3, 2015 7:41 AM
To: Todd Nist
Cc: user@spark.apache.org; Mohammed Guller
Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra


Hi Todd,

Thanks for the link. I would be interested in this solution. I am using DSE for 
cassandra. Would you provide me with info on connecting with DSE either through 
Tableau or zeppelin. The goal here is query cassandra through spark sql so that 
I could perform joins and groupby on my queries. Are you able to perform spark 
sql queries with tableau?

Thanks,
Pawan Venugopal
On Apr 3, 2015 5:03 AM, Todd Nist 
tsind...@gmail.commailto:tsind...@gmail.com wrote:
What version of Cassandra are you using?  Are you using DSE or the stock Apache 
Cassandra version?  I have connected it with DSE, but have not attempted it 
with the standard Apache Cassandra version.

FWIW, 
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not 
provide all the goodness of Spark.  Are you attempting to leverage the 
spark-cassandra-connector for this?



On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Hi –

Is anybody using Tableau to analyze data in Cassandra through the Spark SQL 
Thrift Server?

Thanks!

Mohammed




Re: Cannot run the example in the Spark 1.3.0 following the document

2015-04-03 Thread Sean Owen
(That one was already fixed last week, and so should be updated when
the site updates for 1.3.1.)

On Fri, Apr 3, 2015 at 4:59 AM, Michael Armbrust mich...@databricks.com wrote:
 Looks like a typo, try:

 df.select(df(name), df(age) + 1)

 Or

 df.select(name, age)

 PRs to fix docs are always appreciated :)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: maven compile error

2015-04-03 Thread Sean Owen
If you're asking about a compile error, you should include the command
you used to compile.

I am able to compile branch 1.2 successfully with mvn -DskipTests
clean package.

This error is actually an error from scalac, not a compile error from
the code. It sort of sounds like it has not been able to download
scala dependencies. Check or maybe recreate your environment.

On Fri, Apr 3, 2015 at 3:19 AM, myelinji myeli...@aliyun.com wrote:
 Hi,all:
Just now i checked out spark-1.2 on github , wanna to build it use maven,
 how ever I encountered an error during compiling:

 [INFO]
 
 [ERROR] Failed to execute goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on
 project spark-catalyst_2.10: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found. - [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile
 (scala-compile-first) on project spark-catalyst_2.10: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found.
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
 at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
 at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
 at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
 at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
 at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
 Caused by: org.apache.maven.plugin.MojoExecutionException: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found.
 at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:490)
 at
 org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
 ... 19 more
 Caused by: scala.reflect.internal.MissingRequirementError: object
 scala.runtime in compiler mirror not found.
 at
 scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
 at
 scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
 at
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
 at
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
 at
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
 at scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
 at
 scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
 at
 

Spark Job Failed - Class not serializable

2015-04-03 Thread ๏̯͡๏
My Spark Job failed with


15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed:
saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception:
Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0) had a not
serializable result:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
Serialization stack:
- object not serializable (class:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
{userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
null, currPsLvlId: null}))
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0
in stage 2.0 (TID 0) had a not serializable result:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
Serialization stack:
- object not serializable (class:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
{userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
null, currPsLvlId: null}))
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception: Job aborted due to stage
failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
Serialization stack:



com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto
generated through avro schema using avro-generate-sources maven pulgin.


package com.ebay.ep.poc.spark.reporting.process.model.dw;

@SuppressWarnings(all)

@org.apache.avro.specific.AvroGenerated

public class SpsLevelMetricSum extends
org.apache.avro.specific.SpecificRecordBase implements
org.apache.avro.specific.SpecificRecord {
...
...
}

Can anyone suggest how to fix this ?



-- 
Deepak


Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread ๏̯͡๏
Akhil
you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
how come you got this lib into spark/lib folder.
1) did you place it there ?
2) What is download location ?


On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
 --jars /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName 
 |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral view 
 json_tuple(pathElements, 'name', 'value') v1 as peName, peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores
 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-
 bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference was
 me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I did
 not build Spark but used the version from the Spark download site for 1.2.1
 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in thread
 main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in the
 $HIVE/lib directory did not seem to work.  Though when looking in $HIVE/lib
 for 0.13.1, I do not see any json serde or jackson files.  I do see that
 hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on the
 version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having? Doesn't
 this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the same error:

 15/04/02 12:33:59 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   

答复:maven compile error

2015-04-03 Thread myelinji
Thank you for your reply. When I'm using maven to compile the whole project, 
the erros as follows[INFO] Spark Project Parent POM .. 
SUCCESS [4.136s]
[INFO] Spark Project Networking .. SUCCESS [7.405s]
[INFO] Spark Project Shuffle Streaming Service ... SUCCESS [5.071s]
[INFO] Spark Project Core  SUCCESS [3:08.445s]
[INFO] Spark Project Bagel ... SUCCESS [21.613s]
[INFO] Spark Project GraphX .. SUCCESS [58.915s]
[INFO] Spark Project Streaming ... SUCCESS [1:26.202s]
[INFO] Spark Project Catalyst  FAILURE [1.537s]
[INFO] Spark Project SQL . SKIPPED
[INFO] Spark Project ML Library .. SKIPPED
[INFO] Spark Project Tools ... SKIPPED
[INFO] Spark Project Hive  SKIPPED
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Flume Sink . SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project Examples  SKIPPED
it seems like there is something wrong with calatlyst project. Why i cannot 
compile this project?

--发件人:Sean Owen 
so...@cloudera.com发送时间:2015年4月3日(星期五) 17:48收件人:myelinji 
myeli...@aliyun.com抄 送:spark用户组 user@spark.apache.org主 题:Re: maven compile 
error
If you're asking about a compile error, you should include the command
you used to compile.

I am able to compile branch 1.2 successfully with mvn -DskipTests
clean package.

This error is actually an error from scalac, not a compile error from
the code. It sort of sounds like it has not been able to download
scala dependencies. Check or maybe recreate your environment.

On Fri, Apr 3, 2015 at 3:19 AM, myelinji myeli...@aliyun.com wrote:
 Hi,all:
Just now i checked out spark-1.2 on github , wanna to build it use maven,
 how ever I encountered an error during compiling:

 [INFO]
 
 [ERROR] Failed to execute goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on
 project spark-catalyst_2.10: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found. - [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile
 (scala-compile-first) on project spark-catalyst_2.10: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found.
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
 at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
 at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
 at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
 at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
 at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
 Caused by: org.apache.maven.plugin.MojoExecutionException: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found.
 at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:490)
 at
 

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
Hi Mohammed,

Not sure if you have tried this or not.  You could try using the below api
to start the thriftserver with an existing context.

https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

So something like this based on example from Cheng Lian:

*Server*

import  org.apache.spark.sql.hive.HiveContext
import  org.apache.spark.sql.catalyst.types._

val  sparkContext  =  sc
import  sparkContext._
val  sqlContext  =  new  HiveContext(sparkContext)
import  sqlContext._
makeRDD((1,hello) :: (2,world)
::Nil).toSchemaRDD.cache().registerTempTable(t)
// replace the above with the C* + spark-casandra-connectore to
generate SchemaRDD and registerTempTable

import  org.apache.spark.sql.hive.thriftserver._
HiveThriftServer2.startWithContext(sqlContext)

Then Startup

./bin/beeline -u jdbc:hive2://localhost:1/default
0: jdbc:hive2://localhost:1/default select * from t;


I have not tried this yet from Tableau.   My understanding is that the
tempTable is only valid as long as the sqlContext is, so if one terminates
the code representing the *Server*, and then restarts the standard thrift
server, sbin/start-thriftserver ..., the table won't be available.

Another possibility is to perhaps use the tuplejump cash project,
https://github.com/tuplejump/cash.

HTH.

-Todd

On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using DSE.
 Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once
 you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the stock
 Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.



 FWIW,
 http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
 provide all the goodness of Spark.  Are you attempting to leverage the
 spark-cassandra-connector for this?







 On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed








Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
@Pawan

Not sure if you have seen this or not, but here is a good example by
Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
Tableau is as simple as Mohammed stated with DSE.
https://github.com/jlacefie/sparksqltest.

HTH,
Todd

On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Mohammed,

 Not sure if you have tried this or not.  You could try using the below api
 to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer
 here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:

 *Server*

 import  org.apache.spark.sql.hive.HiveContext
 import  org.apache.spark.sql.catalyst.types._

 val  sparkContext  =  sc
 import  sparkContext._
 val  sqlContext  =  new  HiveContext(sparkContext)
 import  sqlContext._
 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)
 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable

 import  org.apache.spark.sql.hive.thriftserver._
 HiveThriftServer2.startWithContext(sqlContext)

 Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default
 0: jdbc:hive2://localhost:1/default select * from t;


 I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard thrift
 server, sbin/start-thriftserver ..., the table won't be available.

 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.

 HTH.

 -Todd

 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using
 DSE. Create a new DSN in Tableau using the ODBC driver that comes with DSE.
 Once you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the stock
 Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.



 FWIW,
 http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
 provide all the goodness of Spark.  Are you attempting to leverage the
 spark-cassandra-connector for this?







 On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed









Re: Matei Zaharai: Reddit Ask Me Anything

2015-04-03 Thread ben lorica
Happening right now

   
https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Matei-Zaharai-Reddit-Ask-Me-Anything-tp22364p22369.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Tathagata Das
Very good question! This is because the current code is written such that
the ui considers a batch as waiting only when it has actually started being
processed. Thats batched waiting in the job queue is not considered in the
calculation. It is arguable that it may be more intuitive to count that in
the waiting as well.
On Apr 3, 2015 12:59 AM, bit1...@163.com bit1...@163.com wrote:


 I copied the following from the spark streaming UI, I don't know why the
 Waiting batches is 1, my understanding is that it should be 72.
 Following  is my understanding:
 1. Total time is 1minute 35 seconds=95 seconds
 2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds.
 3. Processed batches are 23(Correct, because in my processing code, it
 does nothing but sleep 4 seconds)
 4. Then the waiting batches should be 95-23=72



- *Started at: * Fri Apr 03 15:17:47 CST 2015
- *Time since start: *1 minute 35 seconds
- *Network receivers: *1
- *Batch interval: *1 second
- *Processed batches: *23
- *Waiting batches: *1
- *Received records: *0
- *Processed records: *0


 --
 bit1...@163.com



MLlib: save models to HDFS?

2015-04-03 Thread S. Zhou
I am new to MLib so I have a basic question: is it possible to save MLlib 
models (particularly CF models) to HDFS and then reload it later? If yes, could 
u share some sample code (I could not find it in MLlib tutorial). Thanks!

Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Ted Yu
Maybe add another stat for batches waiting in the job queue ?

Cheers

On Fri, Apr 3, 2015 at 10:01 AM, Tathagata Das t...@databricks.com wrote:

 Very good question! This is because the current code is written such that
 the ui considers a batch as waiting only when it has actually started being
 processed. Thats batched waiting in the job queue is not considered in the
 calculation. It is arguable that it may be more intuitive to count that in
 the waiting as well.
 On Apr 3, 2015 12:59 AM, bit1...@163.com bit1...@163.com wrote:


 I copied the following from the spark streaming UI, I don't know why the
 Waiting batches is 1, my understanding is that it should be 72.
 Following  is my understanding:
 1. Total time is 1minute 35 seconds=95 seconds
 2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds.
 3. Processed batches are 23(Correct, because in my processing code, it
 does nothing but sleep 4 seconds)
 4. Then the waiting batches should be 95-23=72



- *Started at: * Fri Apr 03 15:17:47 CST 2015
- *Time since start: *1 minute 35 seconds
- *Network receivers: *1
- *Batch interval: *1 second
- *Processed batches: *23
- *Waiting batches: *1
- *Received records: *0
- *Processed records: *0


 --
 bit1...@163.com




Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Tathagata Das
Maybe that should be marked as waiting as well. Will keep that in mind. We
plan to update the ui soon, so will keep that in mind.
On Apr 3, 2015 10:12 AM, Ted Yu yuzhih...@gmail.com wrote:

 Maybe add another stat for batches waiting in the job queue ?

 Cheers

 On Fri, Apr 3, 2015 at 10:01 AM, Tathagata Das t...@databricks.com
 wrote:

 Very good question! This is because the current code is written such that
 the ui considers a batch as waiting only when it has actually started being
 processed. Thats batched waiting in the job queue is not considered in the
 calculation. It is arguable that it may be more intuitive to count that in
 the waiting as well.
 On Apr 3, 2015 12:59 AM, bit1...@163.com bit1...@163.com wrote:


 I copied the following from the spark streaming UI, I don't know why the
 Waiting batches is 1, my understanding is that it should be 72.
 Following  is my understanding:
 1. Total time is 1minute 35 seconds=95 seconds
 2. Batch interval is 1 second, so, 95 batches are generated in 95
 seconds.
 3. Processed batches are 23(Correct, because in my processing code, it
 does nothing but sleep 4 seconds)
 4. Then the waiting batches should be 95-23=72



- *Started at: * Fri Apr 03 15:17:47 CST 2015
- *Time since start: *1 minute 35 seconds
- *Network receivers: *1
- *Batch interval: *1 second
- *Processed batches: *23
- *Waiting batches: *1
- *Received records: *0
- *Processed records: *0


 --
 bit1...@163.com





Re: Spark + Kinesis

2015-04-03 Thread Kelly, Jonathan
spark-streaming-kinesis-asl is not part of the Spark distribution on your 
cluster, so you cannot have it be just a provided dependency.  This is also 
why the KCL and its dependencies were not included in the assembly (but yes, 
they should be).

~ Jonathan Kelly

From: Vadim Bichutskiy 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Friday, April 3, 2015 at 12:26 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark + Kinesis

Hi all,

Good news! I was able to create a Kinesis consumer and assemble it into an 
uber jar following 
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and 
example 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala.

However when I try to spark-submit it I get the following exception:

Exception in thread main java.lang.NoClassDefFoundError: 
com/amazonaws/auth/AWSCredentialsProvider

Do I need to include KCL dependency in build.sbt, here's what it looks like 
currently:

import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided
libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0 % provided

assemblySettings
jarName in assembly :=  consumer-assembly.jar
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)

Any help appreciated.

Thanks,
Vadim
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=3d9e0d72-3cbe-4d6f-b262-829b92632515]ᐧ


On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
It looks like you're attempting to mix Scala versions, so that's going to cause 
some problems.  If you really want to use Scala 2.11.5, you must also use Spark 
package versions built for Scala 2.11 rather than 2.10.  Anyway, that's not 
quite the correct way to specify Scala dependencies in build.sbt.  Instead of 
placing the Scala version after the artifactId (like spark-core_2.10), what 
you actually want is to use just spark-core with two percent signs before it. 
 Using two percent signs will make it use the version of Scala that matches 
your declared scalaVersion.  For example:

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided

libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0

I think that may get you a little closer, though I think you're probably going 
to run into the same problems I ran into in this thread: 
https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I never 
really got an answer for that, and I temporarily moved on to other things for 
now.

~ Jonathan Kelly

From: 'Vadim Bichutskiy' 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Thursday, April 2, 2015 at 9:53 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark + Kinesis

Hi all,

I am trying to write an Amazon Kinesis consumer Scala app that processes data 
in the
Kinesis stream. Is this the correct way to specify build.sbt:

---
import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % 
provided,
org.apache.spark % spark-streaming_2.10 % 1.3.0
org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)

assemblySettings
jarName in assembly :=  consumer-assembly.jar
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)


In project/assembly.sbt I have only the following line:

addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)

I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book.

Thanks,
Vadim




Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy
Remove provided and got the following error:

[error] (*:assembly) deduplicate: different file contents found in the
following:

[error]
/Users/vb/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:com/esotericsoftware/minlog/Log$Logger.class

[error]
/Users/vb/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:com/esotericsoftware/minlog/Log$Logger.class
ᐧ

On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t...@databricks.com wrote:

 Just remove provided for spark-streaming-kinesis-asl

 libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl
 % 1.3.0

 On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy 
 vadim.bichuts...@gmail.com wrote:

 Thanks. So how do I fix it?
 ᐧ

 On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

   spark-streaming-kinesis-asl is not part of the Spark distribution on
 your cluster, so you cannot have it be just a provided dependency.  This
 is also why the KCL and its dependencies were not included in the assembly
 (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy vadim.bichuts...@gmail.com
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it
 into an uber jar following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it
 looks like currently:

  import AssemblyKeys._
 name := Kinesis Consumer
 version := 1.0
 organization := com.myconsumer
 scalaVersion := 2.11.5

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0
 % provided
 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0 % provided

  assemblySettings
 jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)

  Any help appreciated.

  Thanks,
 Vadim

 On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

  It looks like you're attempting to mix Scala versions, so that's
 going to cause some problems.  If you really want to use Scala 2.11.5, you
 must also use Spark package versions built for Scala 2.11 rather than
 2.10.  Anyway, that's not quite the correct way to specify Scala
 dependencies in build.sbt.  Instead of placing the Scala version after the
 artifactId (like spark-core_2.10), what you actually want is to use just
 spark-core with two percent signs before it.  Using two percent signs
 will make it use the version of Scala that matches your declared
 scalaVersion.  For example:

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided

  libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided

  libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

  I think that may get you a little closer, though I think you're
 probably going to run into the same problems I ran into in this thread:
 https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
 never really got an answer for that, and I temporarily moved on to other
 things for now.


  ~ Jonathan Kelly

   From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.com
 Date: Thursday, April 2, 2015 at 9:53 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Spark + Kinesis

   Hi all,

  I am trying to write an Amazon Kinesis consumer Scala app that
 processes data in the
 Kinesis stream. Is this the correct way to specify *build.sbt*:

  ---
 *import AssemblyKeys._*
 *name := Kinesis Consumer*






 *version := 1.0 organization := com.myconsumer scalaVersion :=
 2.11.5 libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10
 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 %
 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)*



 * assemblySettings jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)*
 

  In *project/assembly.sbt* I have only the following line:

  *addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)*

  I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark
 book.

  Thanks,
 Vadim







Re: spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Thanks! I'll add the JIRA. I'll also try to work on a patch this weekend
.

- -- Ankur Chauhan

On 03/04/2015 13:23, Tim Chen wrote:
 Hi Ankur,
 
 There isn't a way to do that yet, but it's simple to add.
 
 Can you create a JIRA in Spark for this?
 
 Thanks!
 
 Tim
 
 On Fri, Apr 3, 2015 at 1:08 PM, Ankur Chauhan
 achau...@brightcove.com mailto:achau...@brightcove.com wrote:
 
 Hi,
 
 I am trying to figure out if there is a way to tell the mesos 
 scheduler in spark to isolate the workers to a set of mesos slaves 
 that have a given attribute such as `tachyon:true`.
 
 Anyone knows if that is possible or how I could achieve such a
 behavior.
 
 Thanks! -- Ankur Chauhan
 
 -

 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 mailto:user-unsubscr...@spark.apache.org For additional commands,
 e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 

- -- 
- -- Ankur Chauhan
-BEGIN PGP SIGNATURE-

iQEcBAEBAgAGBQJVHxMDAAoJEOSJAMhvLp3LEPAH/1T7Ywu2W2vEZR/f6KbP+xbd
CiECqbgy1lMw0TxK3jyoiGttTL0uDcgoqev5kjaUFaGgcpsbzZg2jiaqM5RagJRv
55HvGXtSXKQ3l5NlRyMsbmRGVu8qoV2qv2qrCQHLKhVc0ipXEQgSjrkDGx9yP397
Dz1tFMsY/bgvQL0nMAm/HwJokv701IDGeFXFNI4GXhLGcARYDHou4bY0nzZq+w8t
V9vEFji4jyroJmacHdX0np3KsA6tzVItD6Wi9tLKr0+UWDw2Fb1HfYK0CPYX+FK8
dEgZ/hKwNolAzfIF6kHyNKEIf6H6GKihdLxaB23Im7QojvgGNBTqfGV4tGoJLPc=
=KyHk
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
@Pawan,

So it's been a couple of months since I have had a chance to do anything
with Zeppelin, but here is a link to a post on what I did to get it working
https://groups.google.com/forum/#!topic/zeppelin-developers/mCNdyOXNikI.
This may or may not work with the newer releases from Zeppelin.

-Todd

On Fri, Apr 3, 2015 at 3:02 PM, pawan kumar pkv...@gmail.com wrote:

 Hi Todd,

 Thanks for the help. So i was able to get the DSE working with tableau as
 per the link provided by Mohammed. Now i trying to figure out if i could
 write sparksql queries from tableau and get data from DSE. My end goal is
 to get a web based tool where i could write sql queries which will pull
 data from cassandra.

 With Zeppelin I was able to build and run it in EC2 but not sure if
 configurations are right. I am pointing to a spark master which is a remote
 DSE node and all spark and sparksql dependencies are in the remote node. I
 am not sure if i need to install spark and its dependencies in the webui
 (zepplene) node.

 I am not sure talking about zepplelin in this thread is right.

 Thanks once again for all the help.

 Thanks,
 Pawan Venugopal


 On Fri, Apr 3, 2015 at 11:48 AM, Todd Nist tsind...@gmail.com wrote:

 @Pawan

 Not sure if you have seen this or not, but here is a good example by
 Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
 Tableau is as simple as Mohammed stated with DSE.
 https://github.com/jlacefie/sparksqltest.

 HTH,
 Todd

 On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Mohammed,

 Not sure if you have tried this or not.  You could try using the below
 api to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer
 here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:

 *Server*

 import  org.apache.spark.sql.hive.HiveContext
 import  org.apache.spark.sql.catalyst.types._

 val  sparkContext  =  sc
 import  sparkContext._
 val  sqlContext  =  new  HiveContext(sparkContext)
 import  sqlContext._
 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)
 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable

 import  org.apache.spark.sql.hive.thriftserver._
 HiveThriftServer2.startWithContext(sqlContext)

 Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default
 0: jdbc:hive2://localhost:1/default select * from t;


 I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard
 thrift server, sbin/start-thriftserver ..., the table won't be available.

 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.

 HTH.

 -Todd

 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com
 wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using
 DSE. Create a new DSN in Tableau using the ODBC driver that comes with 
 DSE.
 Once you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am
 using DSE for cassandra. Would you provide me with info on connecting with
 DSE either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql 

Re: WordCount example

2015-04-03 Thread Mohit Anchlia
If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
seems to work. I don't understand why though because when I
give spark://ip-10-241-251-232:7077 application seem to bootstrap
successfully, just doesn't create a socket on port ?


On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 I checked the ports using netstat and don't see any connections
 established on that port. Logs show only this:

 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
 app-20150327135048-0002

 Spark ui shows:

 Running Applications
 IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
 app-20150327135048-0002
 http://54.69.225.94:8080/app?appId=app-20150327135048-0002
 NetworkWordCount
 http://ip-10-241-251-232.us-west-2.compute.internal:4040/0512.0 MB2015/03/27
 13:50:48ec2-userWAITING33 s
 Code looks like is being executed:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077

 *public* *static* *void* doWork(String masterUrl){

 SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
 NetworkWordCount);

 JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf, Durations.
 *seconds*(1));

 JavaReceiverInputDStreamString lines = jssc.socketTextStream(localhost,
 );

 System.*out*.println(Successfully created connection);

 *mapAndReduce*(lines);

  jssc.start(); // Start the computation

 jssc.awaitTermination(); // Wait for the computation to terminate

 }

 *public* *static* *void* main(String ...args){

 *doWork*(args[0]);

 }
 And output of the java program after submitting the task:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ec2-user);
 users with modify permissions: Set(ec2-user)
 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
 15/03/27 13:50:46 INFO Remoting: Starting remoting
 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
 :60184]
 15/03/27 13:50:47 INFO Utils: Successfully started service 'sparkDriver'
 on port 60184.
 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20150327135047-5399
 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity 3.5
 GB
 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
 server' on port 57955.
 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.
 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
 http://ip-10-241-251-232.us-west-2.compute.internal:4040
 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
 spark://ip-10-241-251-232:7077...
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
 cluster with app ID app-20150327135048-0002
 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on 58358
 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register BlockManager
 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block manager
 ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
 BlockManagerId(driver, ip-10-241-251-232.us-west-2.compute.internal,
 58358)
 15/03/27 13:50:48 INFO BlockManagerMaster: Registered BlockManager
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after reached minRegisteredResourcesRatio:
 0.0
 15/03/27 13:50:48 INFO ReceiverTracker: ReceiverTracker started
 15/03/27 13:50:48 INFO ForEachDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO ShuffledDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO MappedDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO FlatMappedDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO SocketInputDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO SocketInputDStream: Slide time = 1000 ms
 15/03/27 13:50:48 INFO SocketInputDStream: Storage level =
 StorageLevel(false, false, false, false, 1)
 15/03/27 13:50:48 INFO SocketInputDStream: Checkpoint interval = null
 15/03/27 13:50:48 INFO SocketInputDStream: Remember duration = 1000 ms
 15/03/27 13:50:48 

Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy
Thanks. So how do I fix it?
ᐧ

On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com wrote:

   spark-streaming-kinesis-asl is not part of the Spark distribution on
 your cluster, so you cannot have it be just a provided dependency.  This
 is also why the KCL and its dependencies were not included in the assembly
 (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy vadim.bichuts...@gmail.com
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it into
 an uber jar following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it looks
 like currently:

  import AssemblyKeys._
 name := Kinesis Consumer
 version := 1.0
 organization := com.myconsumer
 scalaVersion := 2.11.5

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl
 % 1.3.0 % provided

  assemblySettings
 jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)

  Any help appreciated.

  Thanks,
 Vadim

 On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

  It looks like you're attempting to mix Scala versions, so that's going
 to cause some problems.  If you really want to use Scala 2.11.5, you must
 also use Spark package versions built for Scala 2.11 rather than 2.10.
 Anyway, that's not quite the correct way to specify Scala dependencies in
 build.sbt.  Instead of placing the Scala version after the artifactId (like
 spark-core_2.10), what you actually want is to use just spark-core with
 two percent signs before it.  Using two percent signs will make it use the
 version of Scala that matches your declared scalaVersion.  For example:

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided

  libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided

  libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

  I think that may get you a little closer, though I think you're
 probably going to run into the same problems I ran into in this thread:
 https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
 never really got an answer for that, and I temporarily moved on to other
 things for now.


  ~ Jonathan Kelly

   From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.com
 Date: Thursday, April 2, 2015 at 9:53 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Spark + Kinesis

   Hi all,

  I am trying to write an Amazon Kinesis consumer Scala app that
 processes data in the
 Kinesis stream. Is this the correct way to specify *build.sbt*:

  ---
 *import AssemblyKeys._*
 *name := Kinesis Consumer*






 *version := 1.0 organization := com.myconsumer scalaVersion :=
 2.11.5 libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10
 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 %
 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)*



 * assemblySettings jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)*
 

  In *project/assembly.sbt* I have only the following line:

  *addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)*

  I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book.

  Thanks,
 Vadim





Re: WordCount example

2015-04-03 Thread Tathagata Das
What does the Spark Standalone UI at port 8080 say about number of cores?

On Fri, Apr 3, 2015 at 2:53 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 [ec2-user@ip-10-241-251-232 s_lib]$ cat /proc/cpuinfo |grep process
 processor   : 0
 processor   : 1
 processor   : 2
 processor   : 3
 processor   : 4
 processor   : 5
 processor   : 6
 processor   : 7

 On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das t...@databricks.com wrote:

 How many cores are present in the works allocated to the standalone
 cluster spark://ip-10-241-251-232:7077 ?


 On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
 seems to work. I don't understand why though because when I
 give spark://ip-10-241-251-232:7077 application seem to bootstrap
 successfully, just doesn't create a socket on port ?


 On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 I checked the ports using netstat and don't see any connections
 established on that port. Logs show only this:

 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
 app-20150327135048-0002

 Spark ui shows:

 Running Applications
 IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
 app-20150327135048-0002
 http://54.69.225.94:8080/app?appId=app-20150327135048-0002
 NetworkWordCount
 http://ip-10-241-251-232.us-west-2.compute.internal:4040/0512.0 
 MB2015/03/27
 13:50:48ec2-userWAITING33 s
 Code looks like is being executed:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077

 *public* *static* *void* doWork(String masterUrl){

 SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
 NetworkWordCount);

 JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf,
 Durations.*seconds*(1));

 JavaReceiverInputDStreamString lines = jssc.socketTextStream(
 localhost, );

 System.*out*.println(Successfully created connection);

 *mapAndReduce*(lines);

  jssc.start(); // Start the computation

 jssc.awaitTermination(); // Wait for the computation to terminate

 }

 *public* *static* *void* main(String ...args){

 *doWork*(args[0]);

 }
 And output of the java program after submitting the task:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to:
 ec2-user
 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ec2-user);
 users with modify permissions: Set(ec2-user)
 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
 15/03/27 13:50:46 INFO Remoting: Starting remoting
 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on
 addresses
 :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
 :60184]
 15/03/27 13:50:47 INFO Utils: Successfully started service
 'sparkDriver' on port 60184.
 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20150327135047-5399
 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity
 3.5 GB
 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
 server' on port 57955.
 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.
 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
 http://ip-10-241-251-232.us-west-2.compute.internal:4040
 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
 spark://ip-10-241-251-232:7077...
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
 cluster with app ID app-20150327135048-0002
 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on
 58358
 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register
 BlockManager
 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block
 manager ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
 BlockManagerId(driver, ip-10-241-251-232.us-west-2.compute.internal,
 58358)
 15/03/27 13:50:48 INFO BlockManagerMaster: Registered BlockManager
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after reached minRegisteredResourcesRatio:
 0.0
 15/03/27 13:50:48 INFO ReceiverTracker: ReceiverTracker started
 

Re: Simple but faster data streaming

2015-04-03 Thread Tathagata Das
I am afraid not. The whole point of Spark Streaming is to make it easy to
do complicated processing on streaming data while interoperating with core
Spark, MLlib, SQL without the operational overheads of maintain 4 different
systems. As a slight cost of achieving that unification, there maybe some
overheads compared to specialized systems that are designed to one specific
thing.

If you have to do something simple that could have been done using Flume,
then the resources needed by the Spark Streaming program shouldn't be too
high. Can you provide more details?

TD

On Thu, Apr 2, 2015 at 11:51 AM, Harut Martirosyan 
harut.martiros...@gmail.com wrote:

 Hi guys.

 Is there a more lightweight way of stream processing with Spark? What we
 want is a simpler way, preferably with no scheduling, which just streams
 the data to destinations multiple.

 We extensively use Spark Core, SQL, Streaming, GraphX, so it's our main
 tool and don't want to add new things to the stack like Storm or Flume, but
 from other side, it really takes much more resources on same streaming than
 our previous setup with Flume, especially if we have multiple destinations
 (triggers multiple actions/scheduling)


 --
 RGRDZ Harut



Re: Spark + Kinesis

2015-04-03 Thread Kelly, Jonathan
Just remove provided from the end of the line where you specify the 
spark-streaming-kinesis-asl dependency.  That will cause that package and all 
of its transitive dependencies (including the KCL, the AWS Java SDK libraries 
and other transitive dependencies) to be included in your uber jar.  They all 
must be in there because they are not part of the Spark distribution in your 
cluster.

However, as I mentioned before, I think making this change might cause you to 
run into the same problems I spoke of in the thread I linked below 
(https://www.mail-archive.com/user@spark.apache.org/msg23891.html), and 
unfortunately I haven't solved that yet.

~ Jonathan Kelly

From: Vadim Bichutskiy 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Friday, April 3, 2015 at 12:45 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark + Kinesis

Thanks. So how do I fix it?
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=51a86f6a-7130-4760-aab3-f4368d8176b9]ᐧ


On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
spark-streaming-kinesis-asl is not part of the Spark distribution on your 
cluster, so you cannot have it be just a provided dependency.  This is also 
why the KCL and its dependencies were not included in the assembly (but yes, 
they should be).

~ Jonathan Kelly

From: Vadim Bichutskiy 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Friday, April 3, 2015 at 12:26 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark + Kinesis

Hi all,

Good news! I was able to create a Kinesis consumer and assemble it into an 
uber jar following 
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and 
example 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala.

However when I try to spark-submit it I get the following exception:

Exception in thread main java.lang.NoClassDefFoundError: 
com/amazonaws/auth/AWSCredentialsProvider

Do I need to include KCL dependency in build.sbt, here's what it looks like 
currently:

import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided
libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0 % provided

assemblySettings
jarName in assembly :=  consumer-assembly.jar
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)

Any help appreciated.

Thanks,
Vadim

On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
It looks like you're attempting to mix Scala versions, so that's going to cause 
some problems.  If you really want to use Scala 2.11.5, you must also use Spark 
package versions built for Scala 2.11 rather than 2.10.  Anyway, that's not 
quite the correct way to specify Scala dependencies in build.sbt.  Instead of 
placing the Scala version after the artifactId (like spark-core_2.10), what 
you actually want is to use just spark-core with two percent signs before it. 
 Using two percent signs will make it use the version of Scala that matches 
your declared scalaVersion.  For example:

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided

libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0

I think that may get you a little closer, though I think you're probably going 
to run into the same problems I ran into in this thread: 
https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I never 
really got an answer for that, and I temporarily moved on to other things for 
now.

~ Jonathan Kelly

From: 'Vadim Bichutskiy' 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Thursday, April 2, 2015 at 9:53 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark + Kinesis

Hi all,

I am trying to write an Amazon Kinesis consumer Scala app that processes data 
in the
Kinesis stream. Is this the correct way to specify build.sbt:

---
import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % 
provided,
org.apache.spark % spark-streaming_2.10 % 1.3.0
org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)


Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Adam Ritter
That doesn't seem like a good solution unfortunately as I would be needing
this to work in a production environment.  Do you know why the limitation
exists for FileInputDStream in the first place?  Unless I'm missing
something important about how some of the internals work I don't see why
this feature could be added in at some point.

On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das t...@databricks.com wrote:

 I sort-a-hacky workaround is to use a queueStream where you can manually
 create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
 that this is for testing only as queueStream does not work with driver
 fautl recovery.

 TD

 On Fri, Apr 3, 2015 at 12:23 PM, adamgerst adamge...@gmail.com wrote:

 So after pulling my hair out for a bit trying to convert one of my
 standard
 spark jobs to streaming I found that FileInputDStream does not support
 nested folders (see the brief mention here

 http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
 the fileStream method returns a FileInputDStream).  So before, for my
 standard job, I was reading from say

 s3n://mybucket/2015/03/02/*log

 And could also modify it to simply get an entire months worth of logs.
 Since the logs are split up based upon their date, when the batch ran for
 the day, I simply passed in a parameter of the date to make sure I was
 reading the correct data

 But since I want to turn this job into a streaming job I need to simply do
 something like

 s3n://mybucket/*log

 This would totally work fine if it were a standard spark application, but
 fails for streaming.  Is there anyway I can get around this limitation?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark + Kinesis

2015-04-03 Thread Daniil Osipov
Assembly settings have an option to exclude jars. You need something
similar to:
assemblyExcludedJars in assembly = (fullClasspath in assembly) map { cp =
val excludes = Set(
  minlog-1.2.jar
)
cp filter { jar = excludes(jar.data.getName) }
  }

in your build file (may need to be refactored into a .scala file)

On Fri, Apr 3, 2015 at 12:57 PM, Vadim Bichutskiy 
vadim.bichuts...@gmail.com wrote:

 Remove provided and got the following error:

 [error] (*:assembly) deduplicate: different file contents found in the
 following:

 [error]
 /Users/vb/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:com/esotericsoftware/minlog/Log$Logger.class

 [error]
 /Users/vb/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:com/esotericsoftware/minlog/Log$Logger.class
 ᐧ

 On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t...@databricks.com wrote:

 Just remove provided for spark-streaming-kinesis-asl

 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

 On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy 
 vadim.bichuts...@gmail.com wrote:

 Thanks. So how do I fix it?
 ᐧ

 On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

   spark-streaming-kinesis-asl is not part of the Spark distribution on
 your cluster, so you cannot have it be just a provided dependency.  This
 is also why the KCL and its dependencies were not included in the assembly
 (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy vadim.bichuts...@gmail.com
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it
 into an uber jar following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it
 looks like currently:

  import AssemblyKeys._
 name := Kinesis Consumer
 version := 1.0
 organization := com.myconsumer
 scalaVersion := 2.11.5

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided
 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0 % provided

  assemblySettings
 jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)

  Any help appreciated.

  Thanks,
 Vadim

 On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

  It looks like you're attempting to mix Scala versions, so that's
 going to cause some problems.  If you really want to use Scala 2.11.5, you
 must also use Spark package versions built for Scala 2.11 rather than
 2.10.  Anyway, that's not quite the correct way to specify Scala
 dependencies in build.sbt.  Instead of placing the Scala version after the
 artifactId (like spark-core_2.10), what you actually want is to use just
 spark-core with two percent signs before it.  Using two percent signs
 will make it use the version of Scala that matches your declared
 scalaVersion.  For example:

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 % provided

  libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided

  libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

  I think that may get you a little closer, though I think you're
 probably going to run into the same problems I ran into in this thread:
 https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
 never really got an answer for that, and I temporarily moved on to other
 things for now.


  ~ Jonathan Kelly

   From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.com
 Date: Thursday, April 2, 2015 at 9:53 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Spark + Kinesis

   Hi all,

  I am trying to write an Amazon Kinesis consumer Scala app that
 processes data in the
 Kinesis stream. Is this the correct way to specify *build.sbt*:

  ---
 *import AssemblyKeys._*
 *name := Kinesis Consumer*






 *version := 1.0 organization := com.myconsumer scalaVersion :=
 2.11.5 libraryDependencies ++= Seq(org.apache.spark % 
 spark-core_2.10
 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 %
 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 
 1.3.0)*



 * assemblySettings jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)*
 

  In 

spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

I am trying to figure out if there is a way to tell the mesos
scheduler in spark to isolate the workers to a set of mesos slaves
that have a given attribute such as `tachyon:true`.

Anyone knows if that is possible or how I could achieve such a behavior.

Thanks!
- -- Ankur Chauhan
-BEGIN PGP SIGNATURE-

iQEcBAEBAgAGBQJVHvMlAAoJEOSJAMhvLp3LaV0H/jtX+KQDyorUESLIKIxFV9KM
QjyPtVquwuZYcwLqCfQbo62RgE/LeTjjxzifTzMM5D6cf4ULBH1TcS3Is2EdOhSm
UTMfJyvK06VFvYMLiGjqN4sBG3DFdamQif18qUJoKXX/Z9cUQO9SaSjIezSq2gd8
0lM3NLEQjsXY5uRJyl9GYDxcFsXPVzt1crXAdrtVsIYAlFmhcrm1n/5+Peix89Oh
vgK1J7e0ei7Rc4/3BR2xr8f9us+Jfqym/xe+45h1YYZxZWrteCa48NOGixuUJjJe
zb1MxNrTFZhPrKFT7pz9kCUZXl7DW5hzoQCH07CXZZI3B7kFS+5rjuEIB9qZXPE=
=cadl
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Regarding MLLIB sparse and dense matrix

2015-04-03 Thread Jeetendra Gangele
Hi All
I am building a logistic regression for matching the person data lets say
two person object is given with their attribute we need to find the score.
that means at side you have 10 millions records and other side we have 1
record , we need to tell which one match with highest score among 1 million.

I am strong the score of similarity algos in dense matrix and considering
this as features. will apply many similarity alogs on one attributes.

Should i use sparse or dense? what happen in dense when score is null or
when some of the attribute is missing?

is there any support for regularized logistic regression ?currently i am
using LogisticRegressionWithSGD.

Regards
jeetendra


Re: variant record by case classes in shell fails?

2015-04-03 Thread Michael Albert
My apologies for following my own post, but a friend just pointed out that if I 
use kryo with reference counting AND copy-and-paste, this runs.
However, if I try to load file, this fails as described below.
I thought load was supposed to be equivalent?
Thanks!-Mike

  From: Michael Albert m_albert...@yahoo.com.INVALID
 To: User user@spark.apache.org 
 Sent: Friday, April 3, 2015 2:45 PM
 Subject: variant record by case classes in shell fails?
   
Greetings!
For me, the code below fails from the shell.However, I can do essentially the 
same from compiled code, exporting the jar.
If I use default serialization or kryo with reference tracking, the error 
message tells me it can't find the constructor for A.If I use kryo with 
reference tracking, I get a stack overflow.
I'm using Spark 1.2.1 on AWS EMR (hadoop 2.4).
I've also tried putting this code inside an object.
Is this just me?  Am I overlooking something obvious?
Thanks!
-Mike
:paste
sealed class AorBcase class A(i: Int) extends AorBcase class B(i: Int, j: Int) 
extends AorB
sc.parallelize(0.until(1)).map{ _ =    val x = A(1)    x}.collect()



  

Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Tathagata Das
Yes, definitely can be added. Just haven't gotten around to doing it :)
There are proposals for this that you can try -
https://github.com/apache/spark/pull/2765/files . Have you review it at
some point.

On Fri, Apr 3, 2015 at 1:08 PM, Adam Ritter adamge...@gmail.com wrote:

 That doesn't seem like a good solution unfortunately as I would be needing
 this to work in a production environment.  Do you know why the limitation
 exists for FileInputDStream in the first place?  Unless I'm missing
 something important about how some of the internals work I don't see why
 this feature could be added in at some point.

 On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das t...@databricks.com
 wrote:

 I sort-a-hacky workaround is to use a queueStream where you can manually
 create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
 that this is for testing only as queueStream does not work with driver
 fautl recovery.

 TD

 On Fri, Apr 3, 2015 at 12:23 PM, adamgerst adamge...@gmail.com wrote:

 So after pulling my hair out for a bit trying to convert one of my
 standard
 spark jobs to streaming I found that FileInputDStream does not support
 nested folders (see the brief mention here

 http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
 the fileStream method returns a FileInputDStream).  So before, for my
 standard job, I was reading from say

 s3n://mybucket/2015/03/02/*log

 And could also modify it to simply get an entire months worth of logs.
 Since the logs are split up based upon their date, when the batch ran for
 the day, I simply passed in a parameter of the date to make sure I was
 reading the correct data

 But since I want to turn this job into a streaming job I need to simply
 do
 something like

 s3n://mybucket/*log

 This would totally work fine if it were a standard spark application, but
 fails for streaming.  Is there anyway I can get around this limitation?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Spark Streaming FileStream Nested File Support

2015-04-03 Thread adamgerst
So after pulling my hair out for a bit trying to convert one of my standard
spark jobs to streaming I found that FileInputDStream does not support
nested folders (see the brief mention here
http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
the fileStream method returns a FileInputDStream).  So before, for my
standard job, I was reading from say

s3n://mybucket/2015/03/02/*log

And could also modify it to simply get an entire months worth of logs. 
Since the logs are split up based upon their date, when the batch ran for
the day, I simply passed in a parameter of the date to make sure I was
reading the correct data

But since I want to turn this job into a streaming job I need to simply do
something like

s3n://mybucket/*log

This would totally work fine if it were a standard spark application, but
fails for streaming.  Is there anyway I can get around this limitation?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: WordCount example

2015-04-03 Thread Mohit Anchlia
[ec2-user@ip-10-241-251-232 s_lib]$ cat /proc/cpuinfo |grep process
processor   : 0
processor   : 1
processor   : 2
processor   : 3
processor   : 4
processor   : 5
processor   : 6
processor   : 7

On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das t...@databricks.com wrote:

 How many cores are present in the works allocated to the standalone
 cluster spark://ip-10-241-251-232:7077 ?


 On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
 seems to work. I don't understand why though because when I
 give spark://ip-10-241-251-232:7077 application seem to bootstrap
 successfully, just doesn't create a socket on port ?


 On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 I checked the ports using netstat and don't see any connections
 established on that port. Logs show only this:

 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
 app-20150327135048-0002

 Spark ui shows:

 Running Applications
 IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
 app-20150327135048-0002
 http://54.69.225.94:8080/app?appId=app-20150327135048-0002
 NetworkWordCount
 http://ip-10-241-251-232.us-west-2.compute.internal:4040/0512.0 
 MB2015/03/27
 13:50:48ec2-userWAITING33 s
 Code looks like is being executed:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077

 *public* *static* *void* doWork(String masterUrl){

 SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
 NetworkWordCount);

 JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf,
 Durations.*seconds*(1));

 JavaReceiverInputDStreamString lines = jssc.socketTextStream(
 localhost, );

 System.*out*.println(Successfully created connection);

 *mapAndReduce*(lines);

  jssc.start(); // Start the computation

 jssc.awaitTermination(); // Wait for the computation to terminate

 }

 *public* *static* *void* main(String ...args){

 *doWork*(args[0]);

 }
 And output of the java program after submitting the task:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ec2-user);
 users with modify permissions: Set(ec2-user)
 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
 15/03/27 13:50:46 INFO Remoting: Starting remoting
 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on
 addresses
 :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
 :60184]
 15/03/27 13:50:47 INFO Utils: Successfully started service 'sparkDriver'
 on port 60184.
 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20150327135047-5399
 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity
 3.5 GB
 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
 server' on port 57955.
 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.
 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
 http://ip-10-241-251-232.us-west-2.compute.internal:4040
 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
 spark://ip-10-241-251-232:7077...
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
 cluster with app ID app-20150327135048-0002
 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on 58358
 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register
 BlockManager
 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block
 manager ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
 BlockManagerId(driver, ip-10-241-251-232.us-west-2.compute.internal,
 58358)
 15/03/27 13:50:48 INFO BlockManagerMaster: Registered BlockManager
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after reached minRegisteredResourcesRatio:
 0.0
 15/03/27 13:50:48 INFO ReceiverTracker: ReceiverTracker started
 15/03/27 13:50:48 INFO ForEachDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO ShuffledDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO 

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
Hi Todd,

Thanks for the help. So i was able to get the DSE working with tableau as
per the link provided by Mohammed. Now i trying to figure out if i could
write sparksql queries from tableau and get data from DSE. My end goal is
to get a web based tool where i could write sql queries which will pull
data from cassandra.

With Zeppelin I was able to build and run it in EC2 but not sure if
configurations are right. I am pointing to a spark master which is a remote
DSE node and all spark and sparksql dependencies are in the remote node. I
am not sure if i need to install spark and its dependencies in the webui
(zepplene) node.

I am not sure talking about zepplelin in this thread is right.

Thanks once again for all the help.

Thanks,
Pawan Venugopal


On Fri, Apr 3, 2015 at 11:48 AM, Todd Nist tsind...@gmail.com wrote:

 @Pawan

 Not sure if you have seen this or not, but here is a good example by
 Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
 Tableau is as simple as Mohammed stated with DSE.
 https://github.com/jlacefie/sparksqltest.

 HTH,
 Todd

 On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Mohammed,

 Not sure if you have tried this or not.  You could try using the below
 api to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer
 here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:

 *Server*

 import  org.apache.spark.sql.hive.HiveContext
 import  org.apache.spark.sql.catalyst.types._

 val  sparkContext  =  sc
 import  sparkContext._
 val  sqlContext  =  new  HiveContext(sparkContext)
 import  sqlContext._
 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)
 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable

 import  org.apache.spark.sql.hive.thriftserver._
 HiveThriftServer2.startWithContext(sqlContext)

 Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default
 0: jdbc:hive2://localhost:1/default select * from t;


 I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard
 thrift server, sbin/start-thriftserver ..., the table won't be available.

 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.

 HTH.

 -Todd

 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com
 wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using
 DSE. Create a new DSN in Tableau using the ODBC driver that comes with DSE.
 Once you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the
 stock Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.



 FWIW,
 

Re: MLlib: save models to HDFS?

2015-04-03 Thread Xiangrui Meng
In 1.3, you can use model.save(sc, hdfs path). You can check the
code examples here:
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples.
-Xiangrui

On Fri, Apr 3, 2015 at 2:17 PM, Justin Yip yipjus...@prediction.io wrote:
 Hello Zhou,

 You can look at the recommendation template of PredictionIO. PredictionIO is
 built on the top of spark. And this template illustrates how you can save
 the ALS model to HDFS and the reload it later.

 Justin


 On Fri, Apr 3, 2015 at 9:16 AM, S. Zhou myx...@yahoo.com.invalid wrote:

 I am new to MLib so I have a basic question: is it possible to save MLlib
 models (particularly CF models) to HDFS and then reload it later? If yes,
 could u share some sample code (I could not find it in MLlib tutorial).
 Thanks!



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: WordCount example

2015-04-03 Thread Tathagata Das
How many cores are present in the works allocated to the standalone cluster
spark://ip-10-241-251-232:7077 ?


On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
 seems to work. I don't understand why though because when I
 give spark://ip-10-241-251-232:7077 application seem to bootstrap
 successfully, just doesn't create a socket on port ?


 On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 I checked the ports using netstat and don't see any connections
 established on that port. Logs show only this:

 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
 app-20150327135048-0002

 Spark ui shows:

 Running Applications
 IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
 app-20150327135048-0002
 http://54.69.225.94:8080/app?appId=app-20150327135048-0002
 NetworkWordCount
 http://ip-10-241-251-232.us-west-2.compute.internal:4040/0512.0 
 MB2015/03/27
 13:50:48ec2-userWAITING33 s
 Code looks like is being executed:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077

 *public* *static* *void* doWork(String masterUrl){

 SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
 NetworkWordCount);

 JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf, Durations.
 *seconds*(1));

 JavaReceiverInputDStreamString lines = jssc.socketTextStream(
 localhost, );

 System.*out*.println(Successfully created connection);

 *mapAndReduce*(lines);

  jssc.start(); // Start the computation

 jssc.awaitTermination(); // Wait for the computation to terminate

 }

 *public* *static* *void* main(String ...args){

 *doWork*(args[0]);

 }
 And output of the java program after submitting the task:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ec2-user);
 users with modify permissions: Set(ec2-user)
 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
 15/03/27 13:50:46 INFO Remoting: Starting remoting
 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
 :60184]
 15/03/27 13:50:47 INFO Utils: Successfully started service 'sparkDriver'
 on port 60184.
 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20150327135047-5399
 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity 3.5
 GB
 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
 server' on port 57955.
 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.
 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
 http://ip-10-241-251-232.us-west-2.compute.internal:4040
 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
 spark://ip-10-241-251-232:7077...
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
 cluster with app ID app-20150327135048-0002
 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on 58358
 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register BlockManager
 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block manager
 ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
 BlockManagerId(driver, ip-10-241-251-232.us-west-2.compute.internal,
 58358)
 15/03/27 13:50:48 INFO BlockManagerMaster: Registered BlockManager
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after reached minRegisteredResourcesRatio:
 0.0
 15/03/27 13:50:48 INFO ReceiverTracker: ReceiverTracker started
 15/03/27 13:50:48 INFO ForEachDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO ShuffledDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO MappedDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO FlatMappedDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO SocketInputDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO SocketInputDStream: Slide time = 1000 ms
 15/03/27 13:50:48 INFO SocketInputDStream: Storage level =
 

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Mohammed Guller
Thanks, Todd.

It is an interesting idea; worth trying.

I think the cash project is old. The tuplejump guy has created another project 
called CalliopeServer2, which works like a charm with BI tools that use JDBC, 
but unfortunately Tableau throws an error when it connects to it.

Mohammed

From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Friday, April 3, 2015 11:39 AM
To: pawan kumar
Cc: Mohammed Guller; user@spark.apache.org
Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra

Hi Mohammed,

Not sure if you have tried this or not.  You could try using the below api to 
start the thriftserver with an existing context.

https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

The one thing that Michael Ambrust @ databrick recommended was this:
You can start a JDBC server with an existing context.  See my answer here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

So something like this based on example from Cheng Lian:

Server

import  org.apache.spark.sql.hive.HiveContext

import  org.apache.spark.sql.catalyst.types._



val  sparkContext  =  sc

import  sparkContext._

val  sqlContext  =  new  HiveContext(sparkContext)

import  sqlContext._

makeRDD((1,hello) :: (2,world) 
::Nil).toSchemaRDD.cache().registerTempTable(t)

// replace the above with the C* + spark-casandra-connectore to generate 
SchemaRDD and registerTempTable



import  org.apache.spark.sql.hive.thriftserver._

HiveThriftServer2.startWithContext(sqlContext)
Then Startup

./bin/beeline -u jdbc:hive2://localhost:1/default

0: jdbc:hive2://localhost:1/default select * from t;


I have not tried this yet from Tableau.   My understanding is that the 
tempTable is only valid as long as the sqlContext is, so if one terminates the 
code representing the Server, and then restarts the standard thrift server, 
sbin/start-thriftserver ..., the table won't be available.

Another possibility is to perhaps use the tuplejump cash project, 
https://github.com/tuplejump/cash.

HTH.

-Todd

On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar 
pkv...@gmail.commailto:pkv...@gmail.com wrote:

Thanks mohammed. Will give it a try today. We would also need the sparksSQL 
piece as we are migrating our data store from oracle to C* and it would be 
easier to maintain all the reports rather recreating each one from scratch.

Thanks,
Pawan Venugopal.
On Apr 3, 2015 7:59 AM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Hi Todd,

We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly with C* 
using the ODBC driver, but now would like to add Spark SQL to the mix. I 
haven’t been able to find any documentation for how to make this combination 
work.

We are using the Spark-Cassandra-Connector in our applications, but haven’t 
been able to figure out how to get the Spark SQL Thrift Server to use it and 
connect to C*. That is the missing piece. Once we solve that piece of the 
puzzle then Tableau should be able to see the tables in C*.

Hi Pawan,
Tableau + C* is pretty straight forward, especially if you are using DSE. 
Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once you 
connect, Tableau allows to use C* keyspace as schema and column families as 
tables.

Mohammed

From: pawan kumar [mailto:pkv...@gmail.commailto:pkv...@gmail.com]
Sent: Friday, April 3, 2015 7:41 AM
To: Todd Nist
Cc: user@spark.apache.orgmailto:user@spark.apache.org; Mohammed Guller
Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra


Hi Todd,

Thanks for the link. I would be interested in this solution. I am using DSE for 
cassandra. Would you provide me with info on connecting with DSE either through 
Tableau or zeppelin. The goal here is query cassandra through spark sql so that 
I could perform joins and groupby on my queries. Are you able to perform spark 
sql queries with tableau?

Thanks,
Pawan Venugopal
On Apr 3, 2015 5:03 AM, Todd Nist 
tsind...@gmail.commailto:tsind...@gmail.com wrote:
What version of Cassandra are you using?  Are you using DSE or the stock Apache 
Cassandra version?  I have connected it with DSE, but have not attempted it 
with the standard Apache Cassandra version.

FWIW, 
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not 
provide all the goodness of Spark.  Are you attempting to leverage the 
spark-cassandra-connector for this?



On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Hi –

Is anybody using Tableau to analyze data in Cassandra through the Spark SQL 
Thrift Server?

Thanks!

Mohammed





Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
@Todd,

I had looked at it yesterday. All these dependencies explained is added in
the DSE node. Do I need to include spark and DSE dependencies in the
Zeppline node?

I built zeppelin with no spark and no hadoop. To my understanding zeppelin
will send a request to a remote master at spark://
ec2-54-163-181-25.compute-1.amazonaws.com:7077  which is a dse box. This
box has all the dependencies. Do i need to install all the dependencies in
zeppline box.

Thanks,
Pawan Venugopal

On Fri, Apr 3, 2015 at 12:08 PM, Todd Nist tsind...@gmail.com wrote:

 @Pawan,

 So it's been a couple of months since I have had a chance to do anything
 with Zeppelin, but here is a link to a post on what I did to get it working
 https://groups.google.com/forum/#!topic/zeppelin-developers/mCNdyOXNikI.
 This may or may not work with the newer releases from Zeppelin.

 -Todd

 On Fri, Apr 3, 2015 at 3:02 PM, pawan kumar pkv...@gmail.com wrote:

 Hi Todd,

 Thanks for the help. So i was able to get the DSE working with tableau as
 per the link provided by Mohammed. Now i trying to figure out if i could
 write sparksql queries from tableau and get data from DSE. My end goal is
 to get a web based tool where i could write sql queries which will pull
 data from cassandra.

 With Zeppelin I was able to build and run it in EC2 but not sure if
 configurations are right. I am pointing to a spark master which is a remote
 DSE node and all spark and sparksql dependencies are in the remote node. I
 am not sure if i need to install spark and its dependencies in the webui
 (zepplene) node.

 I am not sure talking about zepplelin in this thread is right.

 Thanks once again for all the help.

 Thanks,
 Pawan Venugopal


 On Fri, Apr 3, 2015 at 11:48 AM, Todd Nist tsind...@gmail.com wrote:

 @Pawan

 Not sure if you have seen this or not, but here is a good example by
 Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
 Tableau is as simple as Mohammed stated with DSE.
 https://github.com/jlacefie/sparksqltest.

 HTH,
 Todd

 On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Mohammed,

 Not sure if you have tried this or not.  You could try using the below
 api to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer
 here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:

 *Server*

 import  org.apache.spark.sql.hive.HiveContext
 import  org.apache.spark.sql.catalyst.types._

 val  sparkContext  =  sc
 import  sparkContext._
 val  sqlContext  =  new  HiveContext(sparkContext)
 import  sqlContext._
 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)
 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable

 import  org.apache.spark.sql.hive.thriftserver._
 HiveThriftServer2.startWithContext(sqlContext)

 Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default
 0: jdbc:hive2://localhost:1/default select * from t;


 I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard
 thrift server, sbin/start-thriftserver ..., the table won't be available.

 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.

 HTH.

 -Todd

 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and 
 it
 would be easier to maintain all the reports rather recreating each one 
 from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com
 wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work
 directly with C* using the ODBC driver, but now would like to add Spark 
 SQL
 to the mix. I haven’t been able to find any documentation for how to make
 this combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using
 DSE. Create a new DSN in Tableau using the ODBC driver that comes with 
 DSE.
 Once you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 

Re: Spark + Kinesis

2015-04-03 Thread Tathagata Das
Just remove provided for spark-streaming-kinesis-asl

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl
% 1.3.0

On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy 
vadim.bichuts...@gmail.com wrote:

 Thanks. So how do I fix it?
 ᐧ

 On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

   spark-streaming-kinesis-asl is not part of the Spark distribution on
 your cluster, so you cannot have it be just a provided dependency.  This
 is also why the KCL and its dependencies were not included in the assembly
 (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy vadim.bichuts...@gmail.com
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it into
 an uber jar following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it
 looks like currently:

  import AssemblyKeys._
 name := Kinesis Consumer
 version := 1.0
 organization := com.myconsumer
 scalaVersion := 2.11.5

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0
 % provided
 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0 % provided

  assemblySettings
 jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)

  Any help appreciated.

  Thanks,
 Vadim

 On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

  It looks like you're attempting to mix Scala versions, so that's going
 to cause some problems.  If you really want to use Scala 2.11.5, you must
 also use Spark package versions built for Scala 2.11 rather than 2.10.
 Anyway, that's not quite the correct way to specify Scala dependencies in
 build.sbt.  Instead of placing the Scala version after the artifactId (like
 spark-core_2.10), what you actually want is to use just spark-core with
 two percent signs before it.  Using two percent signs will make it use the
 version of Scala that matches your declared scalaVersion.  For example:

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided

  libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided

  libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

  I think that may get you a little closer, though I think you're
 probably going to run into the same problems I ran into in this thread:
 https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
 never really got an answer for that, and I temporarily moved on to other
 things for now.


  ~ Jonathan Kelly

   From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.com
 Date: Thursday, April 2, 2015 at 9:53 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Spark + Kinesis

   Hi all,

  I am trying to write an Amazon Kinesis consumer Scala app that
 processes data in the
 Kinesis stream. Is this the correct way to specify *build.sbt*:

  ---
 *import AssemblyKeys._*
 *name := Kinesis Consumer*






 *version := 1.0 organization := com.myconsumer scalaVersion :=
 2.11.5 libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10
 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 %
 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)*



 * assemblySettings jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)*
 

  In *project/assembly.sbt* I have only the following line:

  *addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)*

  I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark
 book.

  Thanks,
 Vadim






Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Tathagata Das
I sort-a-hacky workaround is to use a queueStream where you can manually
create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
that this is for testing only as queueStream does not work with driver
fautl recovery.

TD

On Fri, Apr 3, 2015 at 12:23 PM, adamgerst adamge...@gmail.com wrote:

 So after pulling my hair out for a bit trying to convert one of my standard
 spark jobs to streaming I found that FileInputDStream does not support
 nested folders (see the brief mention here

 http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
 the fileStream method returns a FileInputDStream).  So before, for my
 standard job, I was reading from say

 s3n://mybucket/2015/03/02/*log

 And could also modify it to simply get an entire months worth of logs.
 Since the logs are split up based upon their date, when the batch ran for
 the day, I simply passed in a parameter of the date to make sure I was
 reading the correct data

 But since I want to turn this job into a streaming job I need to simply do
 something like

 s3n://mybucket/*log

 This would totally work fine if it were a standard spark application, but
 fails for streaming.  Is there anyway I can get around this limitation?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Tim Chen
Hi Ankur,

There isn't a way to do that yet, but it's simple to add.

Can you create a JIRA in Spark for this?

Thanks!

Tim

On Fri, Apr 3, 2015 at 1:08 PM, Ankur Chauhan achau...@brightcove.com
wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi,

 I am trying to figure out if there is a way to tell the mesos
 scheduler in spark to isolate the workers to a set of mesos slaves
 that have a given attribute such as `tachyon:true`.

 Anyone knows if that is possible or how I could achieve such a behavior.

 Thanks!
 - -- Ankur Chauhan
 -BEGIN PGP SIGNATURE-

 iQEcBAEBAgAGBQJVHvMlAAoJEOSJAMhvLp3LaV0H/jtX+KQDyorUESLIKIxFV9KM
 QjyPtVquwuZYcwLqCfQbo62RgE/LeTjjxzifTzMM5D6cf4ULBH1TcS3Is2EdOhSm
 UTMfJyvK06VFvYMLiGjqN4sBG3DFdamQif18qUJoKXX/Z9cUQO9SaSjIezSq2gd8
 0lM3NLEQjsXY5uRJyl9GYDxcFsXPVzt1crXAdrtVsIYAlFmhcrm1n/5+Peix89Oh
 vgK1J7e0ei7Rc4/3BR2xr8f9us+Jfqym/xe+45h1YYZxZWrteCa48NOGixuUJjJe
 zb1MxNrTFZhPrKFT7pz9kCUZXl7DW5hzoQCH07CXZZI3B7kFS+5rjuEIB9qZXPE=
 =cadl
 -END PGP SIGNATURE-

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: MLlib: save models to HDFS?

2015-04-03 Thread Justin Yip
Hello Zhou,

You can look at the recommendation template
http://templates.prediction.io/PredictionIO/template-scala-parallel-recommendation
of PredictionIO. PredictionIO is built on the top of spark. And this
template illustrates how you can save the ALS model to HDFS and the reload
it later.

Justin


On Fri, Apr 3, 2015 at 9:16 AM, S. Zhou myx...@yahoo.com.invalid wrote:

 I am new to MLib so I have a basic question: is it possible to save MLlib
 models (particularly CF models) to HDFS and then reload it later? If yes,
 could u share some sample code (I could not find it in MLlib tutorial).
 Thanks!



Spark TeraSort source request

2015-04-03 Thread Tom
Hi all,

As we all know, Spark has set the record for sorting data, as published on:
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

Here at our group, we would love to verify these results, and compare
machine using this benchmark. We've spend quite some time trying to find the
terasort source code that was used, but can not find it anywhere.

We did find two candidates: 

A version posted by Reynold [1], the posted of the message above. This
version is stuck at // TODO: Add partition-local (external) sorting
using TeraSortRecordOrdering, only generating data. 

Here, Ewan noticed that it didn't appear to be similar to Hadoop TeraSort.
[2] After this he created a version on his own [3]. With this version, we
noticed problems with TeraValidate with datasets above ~10G (as mentioned by
others at [4]. When examining the raw input and output files, it actually
appears that the input data is sorted and the output data unsorted in both
cases. 

Because of this, we believe we did not yet find the actual used source code.
I've tried to search in the Spark User forum archive's, seeing request of
people, indicating a demand, but did not succeed in finding the actual
source code. 

My question:
Could you guys please make the source code of the used TeraSort program,
preferably with settings, available? If not, what are the reasons that this
seems to be withheld?

Thanks for any help,

Tom Hubregtsen 

[1]
https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
[2]
http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
[3] https://github.com/ehiggs/spark-terasort
[4]
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Regarding MLLIB sparse and dense matrix

2015-04-03 Thread Joseph Bradley
If you can examine your data matrix and know that about  1/6 or so of the
values are non-zero (so  5/6 are zeros), then it's probably worth using
sparse vectors.  (1/6 is a rough estimate.)

There is support for L1 and L2 regularization.  You can look at the guide
here:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression
and the API docs linked from the menu.

On Fri, Apr 3, 2015 at 1:24 PM, Jeetendra Gangele gangele...@gmail.com
wrote:

 Hi All
 I am building a logistic regression for matching the person data lets say
 two person object is given with their attribute we need to find the score.
 that means at side you have 10 millions records and other side we have 1
 record , we need to tell which one match with highest score among 1 million.

 I am strong the score of similarity algos in dense matrix and considering
 this as features. will apply many similarity alogs on one attributes.

 Should i use sparse or dense? what happen in dense when score is null or
 when some of the attribute is missing?

 is there any support for regularized logistic regression ?currently i am
 using LogisticRegressionWithSGD.

 Regards
 jeetendra



Migrating from Spark 0.8.0 to Spark 1.3.0

2015-04-03 Thread Ritesh Kumar Singh
Hi,

Are there any tutorials that explains all the changelogs between Spark
0.8.0 and Spark 1.3.0 and how can we approach this issue.


Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
Thanks Mohammed,

I was aware of Calliope, but haven't used it since with since the
spark-cassandra-connector project got released.  I was not aware of the
CalliopeServer2; cool thanks for sharing that one.

I would appreciate it if you could lmk how you decide to proceed with this;
I can see this coming up on my radar in the next few months; thanks.

-Todd

On Fri, Apr 3, 2015 at 5:53 PM, Mohammed Guller moham...@glassbeam.com
wrote:

  Thanks, Todd.



 It is an interesting idea; worth trying.



 I think the cash project is old. The tuplejump guy has created another
 project called CalliopeServer2, which works like a charm with BI tools that
 use JDBC, but unfortunately Tableau throws an error when it connects to it.



 Mohammed



 *From:* Todd Nist [mailto:tsind...@gmail.com]
 *Sent:* Friday, April 3, 2015 11:39 AM
 *To:* pawan kumar
 *Cc:* Mohammed Guller; user@spark.apache.org

 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Mohammed,



 Not sure if you have tried this or not.  You could try using the below api
 to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:


 * Server*

 import  org.apache.spark.sql.hive.HiveContext

 import  org.apache.spark.sql.catalyst.types._



 val  sparkContext  =  sc

 import  sparkContext._

 val  sqlContext  =  new  HiveContext(sparkContext)

 import  sqlContext._

 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)

 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable



 import  org.apache.spark.sql.hive.thriftserver._

 HiveThriftServer2.startWithContext(sqlContext)

   Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default

 0: jdbc:hive2://localhost:1/default select * from t;



   I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard thrift
 server, sbin/start-thriftserver ..., the table won't be available.



 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.



 HTH.



 -Todd



 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.

 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com wrote:

 Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using DSE.
 Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once
 you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the stock
 Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.



 FWIW,
 http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
 provide all the goodness of Spark.  Are you attempting to 

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
 Sweet - I'll have to play with this then! :)
On Fri, Apr 3, 2015 at 19:43 Reynold Xin r...@databricks.com wrote:

 There is already an explode function on DataFrame btw


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712

 I think something like this would work. You might need to play with the
 type.

 df.explode(arrayBufferColumn) { x = x }



 On Fri, Apr 3, 2015 at 6:43 AM, Denny Lee denny.g@gmail.com wrote:

 Thanks Dean - fun hack :)

 On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler deanwamp...@gmail.com
 wrote:

 A hack workaround is to use flatMap:

 rdd.flatMap{ case (date, array) = for (x - array) yield (date, x) }

 For those of you who don't know Scala, the for comprehension iterates
 through the ArrayBuffer, named array and yields new tuples with the date
 and each element. The case expression to the left of the = pattern matches
 on the input tuples.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee denny.g@gmail.com
 wrote:

 Thanks Michael - that was it!  I was drawing a blank on this one for
 some reason - much appreciated!


 On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com
 wrote:

 A lateral view explode using HiveQL.  I'm hopping to add explode
 shorthand directly to the df API in 1.4.

 On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com
 wrote:

 Quick question - the output of a dataframe is in the format of:

 [2015-04, ArrayBuffer(A, B, C, D)]

 and I'd like to return it as:

 2015-04, A
 2015-04, B
 2015-04, C
 2015-04, D

 What's the best way to do this?

 Thanks in advance!








Re: kmeans|| in Spark is not real paralleled?

2015-04-03 Thread Xi Shen
Hi Xingrui,

I have create JIRA https://issues.apache.org/jira/browse/SPARK-6706, and
attached the sample code. But I could not attache the test data. I will
update the bug once I found a place to host the test data.


Thanks,
David


On Tue, Mar 31, 2015 at 8:18 AM Xiangrui Meng men...@gmail.com wrote:

 This PR updated the k-means|| initialization:
 https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d,
 which was included in 1.3.0. It should fix kmean|| initialization with
 large k. Please create a JIRA for this issue and send me the code and the
 dataset to produce this problem. Thanks! -Xiangrui

 On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have opened a couple of threads asking about k-means performance
 problem in Spark. I think I made a little progress.

 Previous I use the simplest way of KMeans.train(rdd, k, maxIterations).
 It uses the kmeans|| initialization algorithm which supposedly to be a
 faster version of kmeans++ and give better results in general.

 But I observed that if the k is very large, the initialization step takes
 a long time. From the CPU utilization chart, it looks like only one thread
 is working. Please see
 https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark
 .

 I read the paper,
 http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it
 points out kmeans++ initialization algorithm will suffer if k is large.
 That's why the paper contributed the kmeans|| algorithm.


 If I invoke KMeans.train by using the random initialization algorithm, I
 do not observe this problem, even with very large k, like k=5000. This
 makes me suspect that the kmeans|| in Spark is not properly implemented and
 do not utilize parallel implementation.


 I have also tested my code and data set with Spark 1.3.0, and I still
 observe this problem. I quickly checked the PR regarding the KMeans
 algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
 and polish, not changing/improving the algorithm.


 I originally worked on Windows 64bit environment, and I also tested on
 Linux 64bit environment. I could provide the code and data set if anyone
 want to reproduce this problem.


 I hope a Spark developer could comment on this problem and help
 identifying if it is a bug.


 Thanks,

 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen





Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Reynold Xin
There is already an explode function on DataFrame btw

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712

I think something like this would work. You might need to play with the
type.

df.explode(arrayBufferColumn) { x = x }



On Fri, Apr 3, 2015 at 6:43 AM, Denny Lee denny.g@gmail.com wrote:

 Thanks Dean - fun hack :)

 On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler deanwamp...@gmail.com wrote:

 A hack workaround is to use flatMap:

 rdd.flatMap{ case (date, array) = for (x - array) yield (date, x) }

 For those of you who don't know Scala, the for comprehension iterates
 through the ArrayBuffer, named array and yields new tuples with the date
 and each element. The case expression to the left of the = pattern matches
 on the input tuples.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee denny.g@gmail.com wrote:

 Thanks Michael - that was it!  I was drawing a blank on this one for
 some reason - much appreciated!


 On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com
 wrote:

 A lateral view explode using HiveQL.  I'm hopping to add explode
 shorthand directly to the df API in 1.4.

 On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com
 wrote:

 Quick question - the output of a dataframe is in the format of:

 [2015-04, ArrayBuffer(A, B, C, D)]

 and I'd like to return it as:

 2015-04, A
 2015-04, B
 2015-04, C
 2015-04, D

 What's the best way to do this?

 Thanks in advance!







Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
Thanks Dean - fun hack :)

On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler deanwamp...@gmail.com wrote:

 A hack workaround is to use flatMap:

 rdd.flatMap{ case (date, array) = for (x - array) yield (date, x) }

 For those of you who don't know Scala, the for comprehension iterates
 through the ArrayBuffer, named array and yields new tuples with the date
 and each element. The case expression to the left of the = pattern matches
 on the input tuples.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee denny.g@gmail.com wrote:

 Thanks Michael - that was it!  I was drawing a blank on this one for some
 reason - much appreciated!


 On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com
 wrote:

 A lateral view explode using HiveQL.  I'm hopping to add explode
 shorthand directly to the df API in 1.4.

 On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com wrote:

 Quick question - the output of a dataframe is in the format of:

 [2015-04, ArrayBuffer(A, B, C, D)]

 and I'd like to return it as:

 2015-04, A
 2015-04, B
 2015-04, C
 2015-04, D

 What's the best way to do this?

 Thanks in advance!







Re: Reading a large file (binary) into RDD

2015-04-03 Thread Dean Wampler
This might be overkill for your needs, but the scodec parser combinator
library might be useful for creating a parser.

https://github.com/scodec/scodec

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 6:53 PM, java8964 java8...@hotmail.com wrote:

 I think implementing your own InputFormat and using
 SparkContext.hadoopFile() is the best option for your case.

 Yong

 --
 From: kvi...@vt.edu
 Date: Thu, 2 Apr 2015 17:31:30 -0400
 Subject: Re: Reading a large file (binary) into RDD
 To: freeman.jer...@gmail.com
 CC: user@spark.apache.org


 The file has a specific structure. I outline it below.

 The input file is basically a representation of a graph.

 INT
 INT(A)
 LONG (B)
 A INTs(Degrees)
 A SHORTINTs  (Vertex_Attribute)
 B INTs
 B INTs
 B SHORTINTs
 B SHORTINTs

 A - number of vertices
 B - number of edges (note that the INTs/SHORTINTs associated with this are
 edge attributes)

 After reading in the file, I need to create two RDDs (one with vertices
 and the other with edges)

 On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com
 wrote:

 Hm, that will indeed be trickier because this method assumes records are
 the same byte size. Is the file an arbitrary sequence of mixed types, or is
 there structure, e.g. short, long, short, long, etc.?

 If you could post a gist with an example of the kind of file and how it
 should look once read in that would be useful!

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 Thanks for the reply. Unfortunately, in my case, the binary file is a mix
 of short and long integers. Is there any other way that could of use here?

 My current method happens to have a large overhead (much more than actual
 computation time). Also, I am short of memory at the driver when it has to
 read the entire file.

 On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com
 wrote:

 If it’s a flat binary file and each record is the same length (in bytes),
 you can use Spark’s binaryRecords method (defined on the SparkContext),
 which loads records from one or more large flat binary files into an RDD.
 Here’s an example in python to show how it works:

 # write data from an array
 from numpy import random
 dat = random.randn(100,5)
 f = open('test.bin', 'w')
 f.write(dat)
 f.close()


 # load the data back in

 from numpy import frombuffer

 nrecords = 5
 bytesize = 8
 recordsize = nrecords * bytesize
 data = sc.binaryRecords('test.bin', recordsize)
 parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))


 # these should be equal
 parsed.first()
 dat[0,:]


 Does that help?

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 What are some efficient ways to read a large file into RDDs?

 For example, have several executors read a specific/unique portion of the
 file and construct RDDs. Is this possible to do in Spark?

 Currently, I am doing a line-by-line read of the file at the driver and
 constructing the RDD.








Spark unit test fails

2015-04-03 Thread Manas Kar
Hi experts,
 I am trying to write unit tests for my spark application which fails with
javax.servlet.FilterRegistration error.

I am using CDH5.3.2 Spark and below is my dependencies list.
val spark   = 1.2.0-cdh5.3.2
val esriGeometryAPI = 1.2
val csvWriter   = 1.0.0
val hadoopClient= 2.3.0
val scalaTest   = 2.2.1
val jodaTime= 1.6.0
val scalajHTTP  = 1.0.1
val avro= 1.7.7
val scopt   = 3.2.0
val config  = 1.2.1
val jobserver   = 0.4.1
val excludeJBossNetty = ExclusionRule(organization = org.jboss.netty)
val excludeIONetty = ExclusionRule(organization = io.netty)
val excludeEclipseJetty = ExclusionRule(organization =
org.eclipse.jetty)
val excludeMortbayJetty = ExclusionRule(organization =
org.mortbay.jetty)
val excludeAsm = ExclusionRule(organization = org.ow2.asm)
val excludeOldAsm = ExclusionRule(organization = asm)
val excludeCommonsLogging = ExclusionRule(organization =
commons-logging)
val excludeSLF4J = ExclusionRule(organization = org.slf4j)
val excludeScalap = ExclusionRule(organization = org.scala-lang,
artifact = scalap)
val excludeHadoop = ExclusionRule(organization = org.apache.hadoop)
val excludeCurator = ExclusionRule(organization = org.apache.curator)
val excludePowermock = ExclusionRule(organization = org.powermock)
val excludeFastutil = ExclusionRule(organization = it.unimi.dsi)
val excludeJruby = ExclusionRule(organization = org.jruby)
val excludeThrift = ExclusionRule(organization = org.apache.thrift)
val excludeServletApi = ExclusionRule(organization = javax.servlet,
artifact = servlet-api)
val excludeJUnit = ExclusionRule(organization = junit)

I found the link (
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-SecurityException-when-running-tests-with-Spark-1-0-0-td6747.html#a6749
) talking about the issue and the work around of the same.
But that work around does not get rid of the problem for me.
I am using an SBT build which can't be changed to maven.

What am I missing?


Stack trace
-
[info] FiltersRDDSpec:
[info] - Spark Filter *** FAILED ***
[info]   java.lang.SecurityException: class
javax.servlet.FilterRegistration's signer information does not match
signer information of other classes in the same package
[info]   at java.lang.ClassLoader.checkCerts(Unknown Source)
[info]   at java.lang.ClassLoader.preDefineClass(Unknown Source)
[info]   at java.lang.ClassLoader.defineClass(Unknown Source)
[info]   at java.security.SecureClassLoader.defineClass(Unknown Source)
[info]   at java.net.URLClassLoader.defineClass(Unknown Source)
[info]   at java.net.URLClassLoader.access$100(Unknown Source)
[info]   at java.net.URLClassLoader$1.run(Unknown Source)
[info]   at java.net.URLClassLoader$1.run(Unknown Source)
[info]   at java.security.AccessController.doPrivileged(Native Method)
[info]   at java.net.URLClassLoader.findClass(Unknown Source)

Thanks
Manas


Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Frank Austin Nothaft
You’ll definitely want to use a Kryo-based serializer for Avro. We have a Kryo 
based serializer that wraps the Avro efficient serializer here.

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

On Apr 3, 2015, at 5:41 AM, Akhil Das ak...@sigmoidanalytics.com wrote:

 Because, its throwing up serializable exceptions and kryo is a serializer to 
 serialize your objects.
 
 Thanks
 Best Regards
 
 On Fri, Apr 3, 2015 at 5:37 PM, Deepak Jain deepuj...@gmail.com wrote:
 I meant that I did not have to use kyro. Why will kyro help fix this issue 
 now ?
 
 Sent from my iPhone
 
 On 03-Apr-2015, at 5:36 pm, Deepak Jain deepuj...@gmail.com wrote:
 
 I was able to write record that extends specificrecord (avro) this class was 
 not auto generated. Do we need to do something extra for auto generated 
 classes 
 
 Sent from my iPhone
 
 On 03-Apr-2015, at 5:06 pm, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 This thread might give you some insights 
 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E
 
 Thanks
 Best Regards
 
 On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
 My Spark Job failed with
 
 
 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed: 
 saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception: 
 Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0) had a not 
 serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt: 
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0, 
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt: 
 null, currPsLvlId: null}))
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
 in stage 2.0 (TID 0) had a not serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt: 
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0, 
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt: 
 null, currPsLvlId: null}))
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage 
 failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 
 
 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto 
 generated through avro schema using avro-generate-sources maven pulgin.
 
 
 package com.ebay.ep.poc.spark.reporting.process.model.dw;  
 
 @SuppressWarnings(all)
 
 @org.apache.avro.specific.AvroGenerated
 
 public class SpsLevelMetricSum extends 
 org.apache.avro.specific.SpecificRecordBase implements 
 org.apache.avro.specific.SpecificRecord {
 
 ...
 ...
 }
 
 Can anyone suggest how to fix this ?
 
 
 
 -- 
 Deepak
 
 
 



Re: Reading a large file (binary) into RDD

2015-04-03 Thread Vijayasarathy Kannan
Thanks everyone for the inputs.

I guess I will try out a custom implementation of InputFormat. But I have
no idea where to start. Are there any code examples of this that might help?

On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler deanwamp...@gmail.com wrote:

 This might be overkill for your needs, but the scodec parser combinator
 library might be useful for creating a parser.

 https://github.com/scodec/scodec

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 6:53 PM, java8964 java8...@hotmail.com wrote:

 I think implementing your own InputFormat and using
 SparkContext.hadoopFile() is the best option for your case.

 Yong

 --
 From: kvi...@vt.edu
 Date: Thu, 2 Apr 2015 17:31:30 -0400
 Subject: Re: Reading a large file (binary) into RDD
 To: freeman.jer...@gmail.com
 CC: user@spark.apache.org


 The file has a specific structure. I outline it below.

 The input file is basically a representation of a graph.

 INT
 INT(A)
 LONG (B)
 A INTs(Degrees)
 A SHORTINTs  (Vertex_Attribute)
 B INTs
 B INTs
 B SHORTINTs
 B SHORTINTs

 A - number of vertices
 B - number of edges (note that the INTs/SHORTINTs associated with this
 are edge attributes)

 After reading in the file, I need to create two RDDs (one with vertices
 and the other with edges)

 On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com
 wrote:

 Hm, that will indeed be trickier because this method assumes records are
 the same byte size. Is the file an arbitrary sequence of mixed types, or is
 there structure, e.g. short, long, short, long, etc.?

 If you could post a gist with an example of the kind of file and how it
 should look once read in that would be useful!

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 Thanks for the reply. Unfortunately, in my case, the binary file is a mix
 of short and long integers. Is there any other way that could of use here?

 My current method happens to have a large overhead (much more than actual
 computation time). Also, I am short of memory at the driver when it has to
 read the entire file.

 On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com
 wrote:

 If it’s a flat binary file and each record is the same length (in bytes),
 you can use Spark’s binaryRecords method (defined on the SparkContext),
 which loads records from one or more large flat binary files into an RDD.
 Here’s an example in python to show how it works:

 # write data from an array
 from numpy import random
 dat = random.randn(100,5)
 f = open('test.bin', 'w')
 f.write(dat)
 f.close()


 # load the data back in

 from numpy import frombuffer

 nrecords = 5
 bytesize = 8
 recordsize = nrecords * bytesize
 data = sc.binaryRecords('test.bin', recordsize)
 parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))


 # these should be equal
 parsed.first()
 dat[0,:]


 Does that help?

 -
 jeremyfreeman.net
 @thefreemanlab

 On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 What are some efficient ways to read a large file into RDDs?

 For example, have several executors read a specific/unique portion of the
 file and construct RDDs. Is this possible to do in Spark?

 Currently, I am doing a line-by-line read of the file at the driver and
 constructing the RDD.









Spark Memory Utilities

2015-04-03 Thread Stephen Carman
I noticed spark has some nice memory tracking estimators in it, but they are 
private. We have some custom implementations of RDD and PairRDD to suit our 
internal needs and it’d be fantastic if we’d be able to just leverage the 
memory estimates that already exist in Spark.

Is there any change they can be made public inside the library or have some 
interface to them such that children classes can make use of them?

Thanks,

Stephen Carman, M.S.
AI Engineer, Coldlight Solutions, LLC
Cell - 267 240 0363
This e-mail is intended solely for the above-mentioned recipient and it may 
contain confidential or privileged information. If you have received it in 
error, please notify us immediately and delete the e-mail. You must not copy, 
distribute, disclose or take any action in reliance on it. In addition, the 
contents of an attachment to this e-mail may contain software viruses which 
could damage your own computer system. While ColdLight Solutions, LLC has taken 
every reasonable precaution to minimize this risk, we cannot accept liability 
for any damage which you sustain as a result of software viruses. You should 
perform your own virus checks before opening the attachment.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Reading a large file (binary) into RDD

2015-04-03 Thread java8964
Hadoop TextInputFormat is a good start.
It is not really that hard. You just need to implement the logic to identify 
the Record delimiter, and think a logic way to represent the Key, Value for 
your RecordReader.
Yong

From: kvi...@vt.edu
Date: Fri, 3 Apr 2015 11:41:13 -0400
Subject: Re: Reading a large file (binary) into RDD
To: deanwamp...@gmail.com
CC: java8...@hotmail.com; user@spark.apache.org

Thanks everyone for the inputs.
I guess I will try out a custom implementation of InputFormat. But I have no 
idea where to start. Are there any code examples of this that might help?
On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler deanwamp...@gmail.com wrote:
This might be overkill for your needs, but the scodec parser combinator library 
might be useful for creating a parser.
https://github.com/scodec/scodec
Dean Wampler, Ph.D.Author: Programming Scala, 2nd Edition (O'Reilly)
Typesafe
@deanwamplerhttp://polyglotprogramming.com

On Thu, Apr 2, 2015 at 6:53 PM, java8964 java8...@hotmail.com wrote:



I think implementing your own InputFormat and using SparkContext.hadoopFile() 
is the best option for your case.
Yong

From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org

The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.

INTINT(A)LONG (B)A INTs(Degrees)A SHORTINTs  
(Vertex_Attribute)B INTsB INTsB SHORTINTsB SHORTINTs

A - number of verticesB - number of edges (note that the INTs/SHORTINTs 
associated with this are edge attributes)
After reading in the file, I need to create two RDDs (one with vertices and the 
other with edges)
On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hm, that will indeed be trickier because this method assumes records are the 
same byte size. Is the file an arbitrary sequence of mixed types, or is there 
structure, e.g. short, long, short, long, etc.? 
If you could post a gist with an example of the kind of file and how it should 
look once read in that would be useful!


-
jeremyfreeman.net
@thefreemanlab



On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of 
short and long integers. Is there any other way that could of use here?
My current method happens to have a large overhead (much more than actual 
computation time). Also, I am short of memory at the driver when it has to read 
the entire file.
On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
If it’s a flat binary file and each record is the same length (in bytes), you 
can use Spark’s binaryRecords method (defined on the SparkContext), which loads 
records from one or more large flat binary files into an RDD. Here’s an example 
in python to show how it works:
# write data from an arrayfrom numpy import randomdat = random.randn(100,5)f = 
open('test.bin', 'w')f.write(dat)f.close()
# load the data back infrom numpy import frombuffernrecords = 5bytesize = 
8recordsize = nrecords * bytesizedata = sc.binaryRecords('test.bin', 
recordsize)parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 
'float'))

# these should be equalparsed.first()dat[0,:]
Does that help?
-
jeremyfreeman.net
@thefreemanlab


On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
What are some efficient ways to read a large file into RDDs?
For example, have several executors read a specific/unique portion of the file 
and construct RDDs. Is this possible to do in Spark?
Currently, I am doing a line-by-line read of the file at the driver and 
constructing the RDD.