date:20150403

I think these are the following configurations that you are looking for:

*spark.locality.wait*: Number of milliseconds to wait to launch a
data-local task before giving up and launching it on a less-local node. The
same wait will be used to step through multiple locality levels
(process-local, node-local, rack-local and then any). It is also possible
to customize the waiting time for each level by setting
spark.locality.wait.node, etc. You should increase this setting if your
tasks are long and see poor locality, but the default usually works well.

Read more here
https://spark.apache.org/docs/latest/configuration.html#scheduling



Thanks
Best Regards

On Fri, Apr 3, 2015 at 8:41 AM, Stephen Merity step...@commoncrawl.org
wrote:

 Hi there,

 I've been using Spark for processing 33,000 gzipped files that contain
 billions of JSON records (the metadata [WAT] dataset from Common Crawl).
 I've hit a few issues and have not yet found the answers from the
 documentation / search. This may well just be me not finding the right
 pages though I promise I've attempted to RTFM thoroughly!

 Is there any way to (a) ensure retry attempts are done on different nodes
 and/or (b) ensure there's a delay between retrying a failing task (similar
 to spark.shuffle.io.retryWait)?

 Optimally when a task fails it should be given to different executors*.
 This is not the case that I've seen. With maxFailures set to 16, the task
 is handed back to the same executor 16 times, even though there are 89
 other nodes.

 The retry attempts are incredibly fast. The transient issue disappears
 quickly (DNS resolution fails to an Amazon bucket) but the 16 retry
 attempts take less than a second, all run on the same flaky node.

 For now I've set the maxFailures to an absurdly high number and that has
 worked around the issue -- the DNS error disappears on the specified
 machine after ~22 seconds (~360 task attempts) -- but that's obviously
 suboptimal.

 Additionally, are there other options for handling node failures? Other
 than maxFailures I've only seen things relating to shuffle failures? In the
 one instance I've had a node lose communication, it killed the job. I'd
 assumed the RDD would reconstruct. For now I've tried to work around it by
 persisting to multiple machines (MEMORY_AND_DISK_SER_2).

 Thanks! ^_^

 --
 Regards,
 Stephen Merity
 Data Scientist @ Common Crawl

Re: Mllib kmeans #iteration

2015-04-03 Thread amoners

Have you refer to official document of kmeans on
https://spark.apache.org/docs/1.1.1/mllib-clustering.html ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353p22365.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Application Stages and DAG

What he meant is that look it up in the Spark UI, specifically in the Stage
tab to see what is taking so long. And yes code snippet helps us debug.

TD

On Fri, Apr 3, 2015 at 12:47 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 You need open the Stage\'s page which is taking time, and see how long its
 spending on GC etc. Also it will be good to post that Stage and its
 previous transformation's code snippet to make us understand it better.

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 1:05 PM, Vijay Innamuri vijay.innam...@gmail.com
 wrote:


 When I run the Spark application (streaming) in local mode I could see
 the execution progress as below..

 [Stage
 0:
 (1817 + 1) / 3125]
 
 [Stage
 2:===
 (740 + 1) / 3125]

 One of the stages is taking long time for execution.

 How to find the transformations/ actions associated with a particular
 stage?
 Is there anyway to find the execution DAG of a Spark Application?

 Regards
 Vijay

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

How did you build spark? which version of spark are you having? Doesn't
this thread already explains it?
https://www.mail-archive.com/user@spark.apache.org/msg25505.html

Thanks
Best Regards

On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and jackson or
 json serde jars in the $HIVE/lib directory.  This is hive 0.13.1 and
 spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
 --jars /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new HiveContext(sc)import 
 sqlContext._case class MetricTable(path: String, pathElements: String, name: 
 String, value: String)val mt = new MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the same error:

 15/04/02 12:33:59 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral view 
 json_tuple(pathElements, 'name', 'value') v1 as peName, peValue
 15/04/02 12:34:00 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
 java.lang.ClassNotFoundException: json_tuple

 Any other suggestions or am I doing something else wrong here?

 -Todd



 On Thu, Apr 2, 2015 at 2:00 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Try adding all the jars in your $HIVE/lib directory. If you want the
 specific jar, you could look fr jackson or json serde in it.

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist tsind...@gmail.com wrote:

 I have a feeling I’m missing a Jar that provides the support or could
 this may be related to https://issues.apache.org/jira/browse/SPARK-5792.
 If it is a Jar where would I find that ? I would have thought in the
 $HIVE/lib folder, but not sure which jar contains it.

 Error:

 Create Metric Temporary Table for querying15/04/01 14:41:44 INFO 
 HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore15/04/01 14:41:44 INFO 
 ObjectStore: ObjectStore, initialize called15/04/01 14:41:45 INFO 
 Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will 
 be ignored15/04/01 14:41:45 INFO Persistence: Property 
 datanucleus.cache.level2 unknown - will be ignored15/04/01 14:41:45 INFO 
 BlockManager: Removing broadcast 015/04/01 14:41:45 INFO BlockManager: 
 Removing block broadcast_015/04/01 14:41:45 INFO MemoryStore: Block 
 broadcast_0 of size 1272 dropped from memory (free 278018571)15/04/01 
 14:41:45 INFO BlockManager: Removing block broadcast_0_piece015/04/01 
 14:41:45 INFO MemoryStore: Block broadcast_0_piece0 of size 869 dropped 
 from memory (free 278019440)15/04/01 14:41:45 INFO BlockManagerInfo: 
 Removed broadcast_0_piece0 on 192.168.1.5:63230 in memory (size: 869.0 B, 
 free: 265.1 MB)15/04/01 14:41:45 INFO BlockManagerMaster: Updated info of 
 block broadcast_0_piece015/04/01 14:41:45 INFO BlockManagerInfo: Removed 
 broadcast_0_piece0 on 192.168.1.5:63278 in memory (size: 869.0 B, free: 
 530.0 MB)15/04/01 14:41:45 INFO ContextCleaner: Cleaned broadcast 015/04/01 
 14:41:46 INFO ObjectStore: Setting MetaStore object pin classes with 
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order15/04/01
  14:41:46 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.15/04/01 14:41:46 
 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder 
 is tagged as embedded-only so does not have its own datastore 
 table.15/04/01 14:41:47 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.15/04/01 14:41:47 
 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder 
 is tagged as embedded-only so does not have its

Re: Spark Streaming Worker runs out of inodes

Did you try these?

- Disable shuffle : spark.shuffle.spill=false
- Enable log rotation:

sparkConf.set(spark.executor.logs.rolling.strategy, size)
.set(spark.executor.logs.rolling.size.maxBytes, 1024)
.set(spark.executor.logs.rolling.maxRetainedFiles, 3)


Thanks
Best Regards

On Fri, Apr 3, 2015 at 9:09 AM, a mesar amesa...@gmail.com wrote:

 Yes, with spark.cleaner.ttl set there is no cleanup.  We pass 
 --properties-file
 spark-dev.conf to spark-submit where  spark-dev.conf contains:

 spark.master spark://10.250.241.66:7077
 spark.logConf true
 spark.cleaner.ttl 1800
 spark.executor.memory 10709m
 spark.cores.max 4
 spark.shuffle.consolidateFiles true

 On Thu, Apr 2, 2015 at 7:12 PM, Tathagata Das t...@databricks.com wrote:

 Are you saying that even with the spark.cleaner.ttl set your files are
 not getting cleaned up?

 TD

 On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote:

 Apparently Spark Streaming 1.3.0 is not cleaning up its internal files
 and
 the worker nodes eventually run out of inodes.
 We see tons of old shuffle_*.data and *.index files that are never
 deleted.
 How do we get Spark to remove these files?

 We have a simple standalone app with one RabbitMQ receiver and a two node
 cluster (2 x r3large AWS instances).
 Batch interval is 10 minutes after which we process data and write
 results
 to DB. No windowing or state mgmt is used.

 I've poured over the documentation and tried setting the following
 properties but they have not helped.
 As a work around we're using a cron script that periodically cleans up
 old
 files but this has a bad smell to it.

 SPARK_WORKER_OPTS in spark-env.sh on every worker node
   spark.worker.cleanup.enabled true
   spark.worker.cleanup.interval
   spark.worker.cleanup.appDataTtl

 Also tried on the driver side:
   spark.cleaner.ttl
   spark.shuffle.consolidateFiles true



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Worker-runs-out-of-inodes-tp22355.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming Worker runs out of inodes

2015-04-03 Thread Charles Feduke

You could also try setting your `nofile` value in /etc/security/limits.conf
for `soft` to some ridiculously high value if you haven't done so already.

On Fri, Apr 3, 2015 at 2:09 AM Akhil Das ak...@sigmoidanalytics.com wrote:

 Did you try these?

 - Disable shuffle : spark.shuffle.spill=false
 - Enable log rotation:

 sparkConf.set(spark.executor.logs.rolling.strategy, size)
 .set(spark.executor.logs.rolling.size.maxBytes, 1024)
 .set(spark.executor.logs.rolling.maxRetainedFiles, 3)


 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 9:09 AM, a mesar amesa...@gmail.com wrote:

 Yes, with spark.cleaner.ttl set there is no cleanup.  We pass 
 --properties-file
 spark-dev.conf to spark-submit where  spark-dev.conf contains:

 spark.master spark://10.250.241.66:7077
 spark.logConf true
 spark.cleaner.ttl 1800
 spark.executor.memory 10709m
 spark.cores.max 4
 spark.shuffle.consolidateFiles true

 On Thu, Apr 2, 2015 at 7:12 PM, Tathagata Das t...@databricks.com
 wrote:

 Are you saying that even with the spark.cleaner.ttl set your files are
 not getting cleaned up?

 TD

 On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote:

 Apparently Spark Streaming 1.3.0 is not cleaning up its internal files
 and
 the worker nodes eventually run out of inodes.
 We see tons of old shuffle_*.data and *.index files that are never
 deleted.
 How do we get Spark to remove these files?

 We have a simple standalone app with one RabbitMQ receiver and a two
 node
 cluster (2 x r3large AWS instances).
 Batch interval is 10 minutes after which we process data and write
 results
 to DB. No windowing or state mgmt is used.

 I've poured over the documentation and tried setting the following
 properties but they have not helped.
 As a work around we're using a cron script that periodically cleans up
 old
 files but this has a bad smell to it.

 SPARK_WORKER_OPTS in spark-env.sh on every worker node
   spark.worker.cleanup.enabled true
   spark.worker.cleanup.interval
   spark.worker.cleanup.appDataTtl

 Also tried on the driver side:
   spark.cleaner.ttl
   spark.shuffle.consolidateFiles true



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Spark-Streaming-Worker-runs-out-of-
 inodes-tp22355.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

About Waiting batches on the spark streaming UI

2015-04-03 Thread bit1...@163.com


I copied the following from the spark streaming UI, I don't know why the 
Waiting batches is 1, my understanding is that it should be 72.
Following  is my understanding: 
1. Total time is 1minute 35 seconds=95 seconds
2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds.
3. Processed batches are 23(Correct, because in my processing code, it does 
nothing but sleep 4 seconds)
4. Then the waiting batches should be 95-23=72


Started at: Fri Apr 03 15:17:47 CST 2015 
Time since start: 1 minute 35 seconds 
Network receivers: 1 
Batch interval: 1 second 
Processed batches: 23 
Waiting batches: 1 
Received records: 0 
Processed records: 0   



bit1...@163.com

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-03 Thread Jaonary Rabarisoa

Good! Thank you.

On Thu, Apr 2, 2015 at 9:05 AM, Xiangrui Meng men...@gmail.com wrote:

 I reproduced the bug on master and submitted a patch for it:
 https://github.com/apache/spark/pull/5329. It may get into Spark
 1.3.1. Thanks for reporting the bug! -Xiangrui

 On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:
  Hmm, I got the same error with the master. Here is another test example
 that
  fails. Here, I explicitly create
  a Row RDD which corresponds to the use case I am in :
 
  object TestDataFrame {
 
def main(args: Array[String]): Unit = {
 
  val conf = new
  SparkConf().setAppName(TestDataFrame).setMaster(local[4])
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)
 
  import sqlContext.implicits._
 
  val data = Seq(LabeledPoint(1, Vectors.zeros(10)))
  val dataDF = sc.parallelize(data).toDF
 
  dataDF.printSchema()
  dataDF.save(test1.parquet) // OK
 
  val dataRow = data.map {case LabeledPoint(l: Double, f:
  mllib.linalg.Vector)=
Row(l,f)
  }
 
  val dataRowRDD = sc.parallelize(dataRow)
  val dataDF2 = sqlContext.createDataFrame(dataRowRDD, dataDF.schema)
 
  dataDF2.printSchema()
 
  dataDF2.saveAsParquetFile(test3.parquet) // FAIL !!!
}
  }
 
 
  On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng men...@gmail.com
 wrote:
 
  I cannot reproduce this error on master, but I'm not aware of any
  recent bug fixes that are related. Could you build and try the current
  master? -Xiangrui
 
  On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com
  wrote:
   Hi all,
  
   DataFrame with an user defined type (here mllib.Vector) created with
   sqlContex.createDataFrame can't be saved to parquet file and raise
   ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot
 be
   cast
   to org.apache.spark.sql.Row error.
  
   Here is an example of code to reproduce this error :
  
   object TestDataFrame {
  
 def main(args: Array[String]): Unit = {
   //System.loadLibrary(Core.NATIVE_LIBRARY_NAME)
   val conf = new
   SparkConf().setAppName(RankingEval).setMaster(local[8])
 .set(spark.executor.memory, 6g)
  
   val sc = new SparkContext(conf)
   val sqlContext = new SQLContext(sc)
  
   import sqlContext.implicits._
  
   val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10
   val dataDF = data.toDF
  
   dataDF.save(test1.parquet)
  
   val dataDF2 = sqlContext.createDataFrame(dataDF.rdd,
 dataDF.schema)
  
   dataDF2.save(test2.parquet)
 }
   }
  
  
   Is this related to https://issues.apache.org/jira/browse/SPARK-5532
 and
   how
   can it be solved ?
  
  
   Cheers,
  
  
   Jao

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

I placed it there.  It was downloaded from MySql site.

On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Akhil
 you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
 how come you got this lib into spark/lib folder.
 1) did you place it there ?
 2) What is download location ?


 On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
 --jars /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName 
 |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
 cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.
 1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having?
 Doesn't this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the same

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

Hi Deepujain,

I did include the jar file, I believe it is hive-exe.jar, through the
--jars option:

./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars
/opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

Results in the same error.  I'm going to do the rebuild in a few minutes.

Thanks for the assistance.

-Todd



On Fri, Apr 3, 2015 at 6:30 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I think you need to include the jar file through --jars option that
 contains the hive definition (code) of UDF json_tuple. That should solve
 your problem.

 On Fri, Apr 3, 2015 at 3:57 PM, Todd Nist tsind...@gmail.com wrote:

 I placed it there.  It was downloaded from MySql site.

 On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Akhil
 you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
 . how come you got this lib into spark/lib folder.
 1) did you place it there ?
 2) What is download location ?


 On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName   
   |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan 
 ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
 cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-
 5.1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  
 I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just 
 by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
  wrote:

 How did you build spark? which version of spark are you having?
 Doesn't this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com
 wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is 
 hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing

Re: 答复：maven compile error

2015-04-03 Thread Ted Yu

Can you include -X in your maven command and pastebin the output ?

Cheers



 On Apr 3, 2015, at 3:58 AM, myelinji myeli...@aliyun.com wrote:
 
 Thank you for your reply. When I'm using maven to compile the whole project, 
 the erros as follows
 
 [INFO] Spark Project Parent POM .. SUCCESS [4.136s]
 [INFO] Spark Project Networking .. SUCCESS [7.405s]
 [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [5.071s]
 [INFO] Spark Project Core  SUCCESS [3:08.445s]
 [INFO] Spark Project Bagel ... SUCCESS [21.613s]
 [INFO] Spark Project GraphX .. SUCCESS [58.915s]
 [INFO] Spark Project Streaming ... SUCCESS [1:26.202s]
 [INFO] Spark Project Catalyst  FAILURE [1.537s]
 [INFO] Spark Project SQL . SKIPPED
 [INFO] Spark Project ML Library .. SKIPPED
 [INFO] Spark Project Tools ... SKIPPED
 [INFO] Spark Project Hive  SKIPPED
 [INFO] Spark Project REPL  SKIPPED
 [INFO] Spark Project Assembly  SKIPPED
 [INFO] Spark Project External Twitter  SKIPPED
 [INFO] Spark Project External Flume Sink . SKIPPED
 [INFO] Spark Project External Flume .. SKIPPED
 [INFO] Spark Project External MQTT ... SKIPPED
 [INFO] Spark Project External ZeroMQ . SKIPPED
 [INFO] Spark Project External Kafka .. SKIPPED
 [INFO] Spark Project Examples  SKIPPED
 
 it seems like there is something wrong with calatlyst project. Why i cannot 
 compile this project?
 
 
 --
 发件人：Sean Owen so...@cloudera.com
 发送时间：2015年4月3日(星期五) 17:48
 收件人：myelinji myeli...@aliyun.com
 抄　送：spark用户组 user@spark.apache.org
 主　题：Re: maven compile error
 
 If you're asking about a compile error, you should include the command
 you used to compile.
 
 I am able to compile branch 1.2 successfully with mvn -DskipTests
 clean package.
 
 This error is actually an error from scalac, not a compile error from
 the code. It sort of sounds like it has not been able to download
 scala dependencies. Check or maybe recreate your environment.
 
 On Fri, Apr 3, 2015 at 3:19 AM, myelinji myeli...@aliyun.com wrote:
  Hi,all:
  Just now i checked out spark-1.2 on github , wanna to build it use maven,
  how ever I encountered an error during compiling:
 
  [INFO]
  
  [ERROR] Failed to execute goal
  net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on
  project spark-catalyst_2.10: wrap:
  scala.reflect.internal.MissingRequirementError: object scala.runtime in
  compiler mirror not found. - [Help 1]
  org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
  goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile
  (scala-compile-first) on project spark-catalyst_2.10: wrap:
  scala.reflect.internal.MissingRequirementError: object scala.runtime in
  compiler mirror not found.
  at
  org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
  at
  org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
  at
  org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
  at
  org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
  at
  org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
  at
  org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
  at
  org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
  at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
  at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
  at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
  at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
  at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
  at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
  Caused by:

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread ๏̯͡๏

I think you need to include the jar file through --jars option that
contains the hive definition (code) of UDF json_tuple. That should solve
your problem.

On Fri, Apr 3, 2015 at 3:57 PM, Todd Nist tsind...@gmail.com wrote:

 I placed it there.  It was downloaded from MySql site.

 On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Akhil
 you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
 how come you got this lib into spark/lib folder.
 1) did you place it there ?
 2) What is download location ?


 On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName
  |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan 
 ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
 cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5
 .1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just 
 by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having?
 Doesn't this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com
 wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path,

Which OS for Spark cluster nodes?

2015-04-03 Thread Horsmann, Tobias

Hi,
Are there any recommendations for operating systems that one should use for 
setting up Spark/Hadoop nodes in general?
I am not familiar with the differences between the various linux distributions 
or how well they are (not) suited for cluster set-ups, so I wondered if there 
is some preferred choices?

Regards,

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

Started the spark shell with the one jar from hive suggested:

./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars
/opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

Results in the same error:

scala sql( | SELECT path, name, value, v1.peValue,
v1.peName |  FROM metric_table |lateral
view json_tuple(pathElements, 'name', 'value') v1 |
as peName, peValue | )
15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path,
name, value, v1.peValue, v1.peName FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
as peName, peValue
15/04/03 06:01:31 INFO ParseDriver: Parse Completed
res2: org.apache.spark.sql.SchemaRDD =
SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
java.lang.ClassNotFoundException: json_tuple

I will try the rebuild.  Thanks again for the assistance.

-Todd


On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores
 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin
 .jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference was
 me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I did
 not build Spark but used the version from the Spark download site for 1.2.1
 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in thread
 main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in the
 $HIVE/lib directory did not seem to work.  Though when looking in $HIVE/lib
 for 0.13.1, I do not see any json serde or jackson files.  I do see that
 hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on the
 version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having? Doesn't
 this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and jackson
 or json serde jars in the $HIVE/lib directory.  This is hive 0.13.1 and
 spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the same error:

 15/04/02 12:33:59 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/02 12:34:00 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan 
 ==
 java.lang.ClassNotFoundException: json_tuple

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

Copy pasted his command in the same thread.

Thanks
Best Regards

On Fri, Apr 3, 2015 at 3:55 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Akhil
 you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
 how come you got this lib into spark/lib folder.
 1) did you place it there ?
 2) What is download location ?


 On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
 --jars /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName 
 |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
 cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.
 1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having?
 Doesn't this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 
 --total-executor-cores 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the

Re: Which OS for Spark cluster nodes?

There isn't any specific Linux distro, but i would prefer Ubuntu for a
beginner as its very easy to apt-get install stuffs on it.

Thanks
Best Regards

On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias tobias.horsm...@uni-due.de
 wrote:

  Hi,
 Are there any recommendations for operating systems that one should use
 for setting up Spark/Hadoop nodes in general?
 I am not familiar with the differences between the various linux
 distributions or how well they are (not) suited for cluster set-ups, so I
 wondered if there is some preferred choices?

  Regards,

Re: Spark Job Failed - Class not serializable

This thread might give you some insights
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E

Thanks
Best Regards

On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 My Spark Job failed with


 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed:
 saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0)
 had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
 null, currPsLvlId: null}))
 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0.0 in stage 2.0 (TID 0) had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
 null, currPsLvlId: null}))
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED,
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage
 failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:

 

 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto
 generated through avro schema using avro-generate-sources maven pulgin.


 package com.ebay.ep.poc.spark.reporting.process.model.dw;

 @SuppressWarnings(all)

 @org.apache.avro.specific.AvroGenerated

 public class SpsLevelMetricSum extends
 org.apache.avro.specific.SpecificRecordBase implements
 org.apache.avro.specific.SpecificRecord {
 ...
 ...
 }

 Can anyone suggest how to fix this ?



 --
 Deepak

Re: Tableau + Spark SQL Thrift Server + Cassandra

What version of Cassandra are you using?  Are you using DSE or the stock
Apache Cassandra version?  I have connected it with DSE, but have not
attempted it with the standard Apache Cassandra version.

FWIW,
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
provide all the goodness of Spark.  Are you attempting to leverage the
spark-cassandra-connector for this?



On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
wrote:

  Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed

Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Aung Htet

Hi Debasish, Charles,

I solved the problem by using a BPQ like method, based on your suggestions.
So thanks very much for that!

My approach was
1) Count the population of each segment in the RDD by map/reduce so that I
get the bound number N equivalent to 10% of each segment. This becomes the
size of the BPQ.
2) Associate the bounds N to the corresponding records in the first RDD.
3) Reduce the RDD from step 2 by merging the values in every two rows,
basically creating a sorted list (Indexed Seq)
4) If the size of the sorted list is greater than N (the bound) then,
create a new sorted list by using a priority queue and dequeuing top N
values.

In the end, I get a record for each segment with N max values for each
segment.

Regards,
Aung








On Fri, Mar 27, 2015 at 4:27 PM, Debasish Das debasish.da...@gmail.com
wrote:

 In that case you can directly use count-min-sketch from algebirdthey
 work fine with Spark aggregateBy but I have found the java BPQ from Spark
 much faster than say algebird Heap datastructure...

 On Thu, Mar 26, 2015 at 10:04 PM, Charles Hayden 
 charles.hay...@atigeo.com wrote:

  You could also consider using a count-min data structure such as in
 https://github.com/laserson/dsq

 to get approximate quantiles, then use whatever values you want to filter
 the original sequence.
  --
 *From:* Debasish Das debasish.da...@gmail.com
 *Sent:* Thursday, March 26, 2015 9:45 PM
 *To:* Aung Htet
 *Cc:* user
 *Subject:* Re: How to get a top X percent of a distribution represented
 as RDD

  Idea is to use a heap and get topK elements from every partition...then
 use aggregateBy and for combOp do a merge routine from
 mergeSort...basically get 100 items from partition 1, 100 items from
 partition 2, merge them so that you get sorted 200 items and take 100...for
 merge you can use heap as well...Matei had a BPQ inside Spark which we use
 all the time...Passing arrays over wire is better than passing full heap
 objects and merge routine on array should run faster but needs experiment...

 On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet aung@gmail.com wrote:

 Hi Debasish,

 Thanks for your suggestions. In-memory version is quite useful. I do not
 quite understand how you can use aggregateBy to get 10% top K elements. Can
 you please give an example?

 Thanks,
 Aung

 On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 You can do it in-memory as wellget 10% topK elements from each
 partition and use merge from any sort algorithm like timsortbasically
 aggregateBy

  Your version uses shuffle but this version is 0 shuffle..assuming
 your data set is cached you will be using in-memory allReduce through
 treeAggregate...

  But this is only good for top 10% or bottom 10%...if you need to do
 it for top 30% then may be the shuffle version will work better...

 On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet aung@gmail.com wrote:

 Hi all,

  I have a distribution represented as an RDD of tuples, in rows of
 (segment, score)
 For each segment, I want to discard tuples with top X percent scores.
 This seems hard to do in Spark RDD.

  A naive algorithm would be -

  1) Sort RDD by segment  score (descending)
 2) Within each segment, number the rows from top to bottom.
 3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut
 off out of a segment with 100 rows.
 4) For the entire RDD, filter rows with row num = cut off index

 This does not look like a good algorithm. I would really appreciate if
 someone can suggest a better way to implement this in Spark.

  Regards,
 Aung

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Dean Wampler

A hack workaround is to use flatMap:

rdd.flatMap{ case (date, array) = for (x - array) yield (date, x) }

For those of you who don't know Scala, the for comprehension iterates
through the ArrayBuffer, named array and yields new tuples with the date
and each element. The case expression to the left of the = pattern matches
on the input tuples.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee denny.g@gmail.com wrote:

 Thanks Michael - that was it!  I was drawing a blank on this one for some
 reason - much appreciated!


 On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com
 wrote:

 A lateral view explode using HiveQL.  I'm hopping to add explode
 shorthand directly to the df API in 1.4.

 On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com wrote:

 Quick question - the output of a dataframe is in the format of:

 [2015-04, ArrayBuffer(A, B, C, D)]

 and I'd like to return it as:

 2015-04, A
 2015-04, B
 2015-04, C
 2015-04, D

 What's the best way to do this?

 Thanks in advance!

Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Deepak Jain

I was able to write record that extends specificrecord (avro) this class was 
not auto generated. Do we need to do something extra for auto generated classes 

Sent from my iPhone

 On 03-Apr-2015, at 5:06 pm, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 This thread might give you some insights 
 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E
 
 Thanks
 Best Regards
 
 On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
 My Spark Job failed with
 
 
 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed: 
 saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception: 
 Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0) had a not 
 serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
  - object not serializable (class: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt: 
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
  - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
  - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0, 
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt: 
 null, currPsLvlId: null}))
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
 in stage 2.0 (TID 0) had a not serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
  - object not serializable (class: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt: 
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
  - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
  - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0, 
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt: 
 null, currPsLvlId: null}))
  at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
  at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
  at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
  at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
  at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
  at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
  at scala.Option.foreach(Option.scala:236)
  at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
  at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
  at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage 
 failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 
 
 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto 
 generated through avro schema using avro-generate-sources maven pulgin.
 
 
 package com.ebay.ep.poc.spark.reporting.process.model.dw;  
 
 @SuppressWarnings(all)
 
 @org.apache.avro.specific.AvroGenerated
 
 public class SpsLevelMetricSum extends 
 org.apache.avro.specific.SpecificRecordBase implements 
 org.apache.avro.specific.SpecificRecord {
 
 ...
 ...
 }
 
 Can anyone suggest how to fix this ?
 
 
 
 -- 
 Deepak

Re: Spark 1.3 UDF ClassNotFoundException

2015-04-03 Thread Markus Ganter

My apologizes. I was running this locally and the JAR I was building
using Intellij had some issues.
This was not related to UDFs. All works fine now.

On Thu, Apr 2, 2015 at 2:58 PM, Ted Yu yuzhih...@gmail.com wrote:
 Can you show more code in CreateMasterData ?

 How do you run your code ?

 Thanks

 On Thu, Apr 2, 2015 at 11:06 AM, ganterm gant...@gmail.com wrote:

 Hello,

 I started to use the dataframe API in Spark 1.3 with Scala.
 I am trying to implement a UDF and am following the sample here:

 https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.UserDefinedFunction
 meaning
 val predict = udf((score: Double) = if (score  0.5) true else false)
 df.select( predict(df(score)) )
 All compiles just fine but when I run it, I get a ClassNotFoundException
 (see more details below)
 I am sure that I load the data correctly and that I have a field called
 score with the correct data type.
 Do I need to do anything else like registering the function?

 Thanks!
 Markus

 Exception in thread main org.apache.spark.SparkException: Job aborted
 due
 to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure:
 Lost task 0.3 in stage 6.0 (TID 11, BillSmithPC):
 java.lang.ClassNotFoundException: test.CreateMasterData$$anonfun$1
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at

 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65)
 ...




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-UDF-ClassNotFoundException-tp22361.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Parquet timestamp support for Hive?

2015-04-03 Thread Rex Xiong

Hi,

I got this error when creating a hive table from parquet file:
DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.UnsupportedOperationException: Parquet does not support
timestamp. See HIVE-6384

I check HIVE-6384, it's fixed in 0.14.
The hive in spark build is a customized version 0.13.1a
(GroupId: org.spark-project.hive), is it possible to get the source code
for it and apply patch from HIVE-6384?

Thanks

Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Deepak Jain

I meant that I did not have to use kyro. Why will kyro help fix this issue now ?

Sent from my iPhone

 On 03-Apr-2015, at 5:36 pm, Deepak Jain deepuj...@gmail.com wrote:
 
 I was able to write record that extends specificrecord (avro) this class was 
 not auto generated. Do we need to do something extra for auto generated 
 classes 
 
 Sent from my iPhone
 
 On 03-Apr-2015, at 5:06 pm, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 This thread might give you some insights 
 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E
 
 Thanks
 Best Regards
 
 On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
 My Spark Job failed with
 
 
 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed: 
 saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception: 
 Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0) had a not 
 serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt: 
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0, 
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt: 
 null, currPsLvlId: null}))
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
 in stage 2.0 (TID 0) had a not serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt: 
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0, 
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt: 
 null, currPsLvlId: null}))
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage 
 failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result: 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 
 
 
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto 
 generated through avro schema using avro-generate-sources maven pulgin.
 
 
 package com.ebay.ep.poc.spark.reporting.process.model.dw;  
 
 @SuppressWarnings(all)
 
 @org.apache.avro.specific.AvroGenerated
 
 public class SpsLevelMetricSum extends 
 org.apache.avro.specific.SpecificRecordBase implements 
 org.apache.avro.specific.SpecificRecord {
 
 ...
 ...
 }
 
 Can anyone suggest how to fix this ?
 
 
 
 -- 
 Deepak

Re: Spark Job Failed - Class not serializable

Because, its throwing up serializable exceptions and kryo is a serializer
to serialize your objects.

Thanks
Best Regards

On Fri, Apr 3, 2015 at 5:37 PM, Deepak Jain deepuj...@gmail.com wrote:

 I meant that I did not have to use kyro. Why will kyro help fix this issue
 now ?

 Sent from my iPhone

 On 03-Apr-2015, at 5:36 pm, Deepak Jain deepuj...@gmail.com wrote:

 I was able to write record that extends specificrecord (avro) this class
 was not auto generated. Do we need to do something extra for auto generated
 classes

 Sent from my iPhone

 On 03-Apr-2015, at 5:06 pm, Akhil Das ak...@sigmoidanalytics.com wrote:

 This thread might give you some insights
 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 My Spark Job failed with


 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed:
 saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0)
 had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
 null, currPsLvlId: null}))
 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0.0 in stage 2.0 (TID 0) had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:
 - object not serializable (class:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
 {userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
 null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
 - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
 - object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
 spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
 null, currPsLvlId: null}))
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED,
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage
 failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result:
 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
 Serialization stack:

 

 com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is
 auto generated through avro schema using avro-generate-sources maven pulgin.


 package com.ebay.ep.poc.spark.reporting.process.model.dw;

 @SuppressWarnings(all)

 @org.apache.avro.specific.AvroGenerated

 public class SpsLevelMetricSum extends
 org.apache.avro.specific.SpecificRecordBase implements
 org.apache.avro.specific.SpecificRecord {
 ...
 ...
 }

 Can anyone suggest how to fix this ?



 --
 Deepak

RE: Tableau + Spark SQL Thrift Server + Cassandra

Thanks mohammed. Will give it a try today. We would also need the sparksSQL
piece as we are migrating our data store from oracle to C* and it would be
easier to maintain all the reports rather recreating each one from scratch.

Thanks,
Pawan Venugopal.
On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com wrote:

Hi Todd,

We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
with C* using the ODBC driver, but now would like to add Spark SQL to the
mix. I haven’t been able to find any documentation for how to make this
combination work.

We are using the Spark-Cassandra-Connector in our applications, but
haven’t been able to figure out how to get the Spark SQL Thrift Server to
use it and connect to C*. That is the missing piece. Once we solve that
piece of the puzzle then Tableau should be able to see the tables in C*.

Hi Pawan,

Tableau + C* is pretty straight forward, especially if you are using DSE.
Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once
you connect, Tableau allows to use C* keyspace as schema and column
families as tables.

Mohammed

*From:* pawan kumar [mailto:pkv...@gmail.com]
*Sent:* Friday, April 3, 2015 7:41 AM
*To:* Todd Nist
*Cc:* user@spark.apache.org; Mohammed Guller
*Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra

Hi Todd,

Thanks for the link. I would be interested in this solution. I am using
DSE for cassandra. Would you provide me with info on connecting with DSE
either through Tableau or zeppelin. The goal here is query cassandra
through spark sql so that I could perform joins and groupby on my queries.
Are you able to perform spark sql queries with tableau?

Thanks,
Pawan Venugopal

On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

What version of Cassandra are you using? Are you using DSE or the stock
Apache Cassandra version? I have connected it with DSE, but have not
attempted it with the standard Apache Cassandra version.

FWIW,
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
provides an ODBC driver tor accessing C* from Tableau. Granted it does not
provide all the goodness of Spark. Are you attempting to leverage the
spark-cassandra-connector for this?

On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
wrote:

Hi –

Is anybody using Tableau to analyze data in Cassandra through the Spark
SQL Thrift Server?

Thanks!

Mohammed

Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Debasish Das

Cool !

You should also consider to contribute it back to spark if you are doing
quantile calculations for example...there is also topbykey api added in
master by @coderxiangsee if you can use that API to make the code
clean
On Apr 3, 2015 5:20 AM, Aung Htet aung@gmail.com wrote:

 Hi Debasish, Charles,

 I solved the problem by using a BPQ like method, based on your
 suggestions. So thanks very much for that!

 My approach was
 1) Count the population of each segment in the RDD by map/reduce so that I
 get the bound number N equivalent to 10% of each segment. This becomes the
 size of the BPQ.
 2) Associate the bounds N to the corresponding records in the first RDD.
 3) Reduce the RDD from step 2 by merging the values in every two rows,
 basically creating a sorted list (Indexed Seq)
 4) If the size of the sorted list is greater than N (the bound) then,
 create a new sorted list by using a priority queue and dequeuing top N
 values.

 In the end, I get a record for each segment with N max values for each
 segment.

 Regards,
 Aung








 On Fri, Mar 27, 2015 at 4:27 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 In that case you can directly use count-min-sketch from algebirdthey
 work fine with Spark aggregateBy but I have found the java BPQ from Spark
 much faster than say algebird Heap datastructure...

 On Thu, Mar 26, 2015 at 10:04 PM, Charles Hayden 
 charles.hay...@atigeo.com wrote:

  You could also consider using a count-min data structure such as in
 https://github.com/laserson/dsq

 to get approximate quantiles, then use whatever values you want to
 filter the original sequence.
  --
 *From:* Debasish Das debasish.da...@gmail.com
 *Sent:* Thursday, March 26, 2015 9:45 PM
 *To:* Aung Htet
 *Cc:* user
 *Subject:* Re: How to get a top X percent of a distribution represented
 as RDD

  Idea is to use a heap and get topK elements from every
 partition...then use aggregateBy and for combOp do a merge routine from
 mergeSort...basically get 100 items from partition 1, 100 items from
 partition 2, merge them so that you get sorted 200 items and take 100...for
 merge you can use heap as well...Matei had a BPQ inside Spark which we use
 all the time...Passing arrays over wire is better than passing full heap
 objects and merge routine on array should run faster but needs experiment...

 On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet aung@gmail.com wrote:

 Hi Debasish,

 Thanks for your suggestions. In-memory version is quite useful. I do
 not quite understand how you can use aggregateBy to get 10% top K elements.
 Can you please give an example?

 Thanks,
 Aung

 On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das debasish.da...@gmail.com
  wrote:

 You can do it in-memory as wellget 10% topK elements from each
 partition and use merge from any sort algorithm like timsortbasically
 aggregateBy

  Your version uses shuffle but this version is 0 shuffle..assuming
 your data set is cached you will be using in-memory allReduce through
 treeAggregate...

  But this is only good for top 10% or bottom 10%...if you need to do
 it for top 30% then may be the shuffle version will work better...

 On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet aung@gmail.com wrote:

 Hi all,

  I have a distribution represented as an RDD of tuples, in rows of
 (segment, score)
 For each segment, I want to discard tuples with top X percent scores.
 This seems hard to do in Spark RDD.

  A naive algorithm would be -

  1) Sort RDD by segment  score (descending)
 2) Within each segment, number the rows from top to bottom.
 3) For each  segment, calculate the cut off index. i.e. 90 for 10%
 cut off out of a segment with 100 rows.
 4) For the entire RDD, filter rows with row num = cut off index

 This does not look like a good algorithm. I would really appreciate
 if someone can suggest a better way to implement this in Spark.

  Regards,
 Aung

Re: Which OS for Spark cluster nodes?

2015-04-03 Thread Charles Feduke

As Akhil says Ubuntu is a good choice if you're starting from near scratch.

Cloudera CDH virtual machine images[1] include Hadoop, HDFS, Spark, and
other big data tools so you can get a cluster running with very little
effort. Keep in mind Cloudera is a for-profit corporation so they are also
selling a product.

Personally I prefer the EC2 scripts[2] that ship with the downloadable
Spark distribution. It provisions a cluster for you on AWS and you can
easily terminate the cluster when you don't need it. Ganglia (monitoring),
HDFS (ephemeral and EBS backed), Tachyon (caching), and Spark are all
installed automatically. For learning, using a cluster of 4 medium machines
is fairly inexpensive. (I use the EC2 scripts for both an integration and
production environment.)

1.
http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html
2. https://spark.apache.org/docs/latest/ec2-scripts.html

On Fri, Apr 3, 2015 at 7:38 AM Akhil Das ak...@sigmoidanalytics.com wrote:

 There isn't any specific Linux distro, but i would prefer Ubuntu for a
 beginner as its very easy to apt-get install stuffs on it.

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias 
 tobias.horsm...@uni-due.de wrote:

  Hi,
 Are there any recommendations for operating systems that one should use
 for setting up Spark/Hadoop nodes in general?
 I am not familiar with the differences between the various linux
 distributions or how well they are (not) suited for cluster set-ups, so I
 wondered if there is some preferred choices?

  Regards,

Re: Tableau + Spark SQL Thrift Server + Cassandra

Hi Todd,

Thanks for the link. I would be interested in this solution. I am using DSE
for cassandra. Would you provide me with info on connecting with DSE either
through Tableau or zeppelin. The goal here is query cassandra through spark
sql so that I could perform joins and groupby on my queries. Are you able
to perform spark sql queries with tableau?

Thanks,
Pawan Venugopal
On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the stock
 Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.

 FWIW,
 http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
 provide all the goodness of Spark.  Are you attempting to leverage the
 spark-cassandra-connector for this?



 On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

  Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Mohammed Guller

Hi Todd,

We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly with C* 
using the ODBC driver, but now would like to add Spark SQL to the mix. I 
haven’t been able to find any documentation for how to make this combination 
work.

We are using the Spark-Cassandra-Connector in our applications, but haven’t 
been able to figure out how to get the Spark SQL Thrift Server to use it and 
connect to C*. That is the missing piece. Once we solve that piece of the 
puzzle then Tableau should be able to see the tables in C*.

Hi Pawan,
Tableau + C* is pretty straight forward, especially if you are using DSE. 
Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once you 
connect, Tableau allows to use C* keyspace as schema and column families as 
tables.

Mohammed

From: pawan kumar [mailto:pkv...@gmail.com]
Sent: Friday, April 3, 2015 7:41 AM
To: Todd Nist
Cc: user@spark.apache.org; Mohammed Guller
Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra


Hi Todd,

Thanks for the link. I would be interested in this solution. I am using DSE for 
cassandra. Would you provide me with info on connecting with DSE either through 
Tableau or zeppelin. The goal here is query cassandra through spark sql so that 
I could perform joins and groupby on my queries. Are you able to perform spark 
sql queries with tableau?

Thanks,
Pawan Venugopal
On Apr 3, 2015 5:03 AM, Todd Nist 
tsind...@gmail.commailto:tsind...@gmail.com wrote:
What version of Cassandra are you using?  Are you using DSE or the stock Apache 
Cassandra version?  I have connected it with DSE, but have not attempted it 
with the standard Apache Cassandra version.

FWIW, 
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not 
provide all the goodness of Spark.  Are you attempting to leverage the 
spark-cassandra-connector for this?



On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Hi –

Is anybody using Tableau to analyze data in Cassandra through the Spark SQL 
Thrift Server?

Thanks!

Mohammed

Re: Cannot run the example in the Spark 1.3.0 following the document

2015-04-03 Thread Sean Owen

(That one was already fixed last week, and so should be updated when
the site updates for 1.3.1.)

On Fri, Apr 3, 2015 at 4:59 AM, Michael Armbrust mich...@databricks.com wrote:
 Looks like a typo, try:

 df.select(df(name), df(age) + 1)

 Or

 df.select(name, age)

 PRs to fix docs are always appreciated :)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: maven compile error

2015-04-03 Thread Sean Owen

If you're asking about a compile error, you should include the command
you used to compile.

I am able to compile branch 1.2 successfully with mvn -DskipTests
clean package.

This error is actually an error from scalac, not a compile error from
the code. It sort of sounds like it has not been able to download
scala dependencies. Check or maybe recreate your environment.

On Fri, Apr 3, 2015 at 3:19 AM, myelinji myeli...@aliyun.com wrote:
 Hi,all:
Just now i checked out spark-1.2 on github , wanna to build it use maven,
 how ever I encountered an error during compiling:

 [INFO]
 
 [ERROR] Failed to execute goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on
 project spark-catalyst_2.10: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found. - [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile
 (scala-compile-first) on project spark-catalyst_2.10: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found.
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
 at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
 at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
 at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
 at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
 at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
 Caused by: org.apache.maven.plugin.MojoExecutionException: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found.
 at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:490)
 at
 org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
 ... 19 more
 Caused by: scala.reflect.internal.MissingRequirementError: object
 scala.runtime in compiler mirror not found.
 at
 scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
 at
 scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
 at
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
 at
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
 at
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
 at scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
 at
 scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
 at
 scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
 at

Spark Job Failed - Class not serializable

2015-04-03 Thread ๏̯͡๏

My Spark Job failed with


15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed:
saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception:
Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0) had a not
serializable result:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
Serialization stack:
- object not serializable (class:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
{userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
null, currPsLvlId: null}))
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0
in stage 2.0 (TID 0) had a not serializable result:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
Serialization stack:
- object not serializable (class:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
{userId: 0, spsPrgrmId: 0, spsSlrLevelCd: 0, spsSlrLevelSumStartDt:
null, spsSlrLevelSumEndDt: null, currPsLvlId: null})
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (0,{userId: 0, spsPrgrmId: 0,
spsSlrLevelCd: 0, spsSlrLevelSumStartDt: null, spsSlrLevelSumEndDt:
null, currPsLvlId: null}))
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception: Job aborted due to stage
failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
Serialization stack:



com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto
generated through avro schema using avro-generate-sources maven pulgin.


package com.ebay.ep.poc.spark.reporting.process.model.dw;

@SuppressWarnings(all)

@org.apache.avro.specific.AvroGenerated

public class SpsLevelMetricSum extends
org.apache.avro.specific.SpecificRecordBase implements
org.apache.avro.specific.SpecificRecord {
...
...
}

Can anyone suggest how to fix this ?



-- 
Deepak

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread ๏̯͡๏

Akhil
you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
how come you got this lib into spark/lib folder.
1) did you place it there ?
2) What is download location ?


On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:

 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
 --jars /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala sql( | SELECT path, name, value, v1.peValue, v1.peName 
 |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | )
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral view 
 json_tuple(pathElements, 'name', 'value') v1 as peName, peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try building Spark
 https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores
 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-
 bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference was
 me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I did
 not build Spark but used the version from the Spark download site for 1.2.1
 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in thread
 main java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in the
 $HIVE/lib directory did not seem to work.  Though when looking in $HIVE/lib
 for 0.13.1, I do not see any json serde or jackson files.  I do see that
 hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on the
 version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How did you build spark? which version of spark are you having? Doesn't
 this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 jackson or json serde jars in the $HIVE/lib directory.  This is hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable(path: /DC1/HOST1/,
 pathElements: [{node: DataCenter,value: DC1},{node: 
 host,value: HOST1}],
 name: Memory Usage (%),
 value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable(metric_table)
 sql(
 SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 )
 .collect.foreach(println(_))

 It results in the same error:

 15/04/02 12:33:59 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table

答复：maven compile error

2015-04-03 Thread myelinji

Thank you for your reply. When I'm using maven to compile the whole project, 
the erros as follows[INFO] Spark Project Parent POM .. 
SUCCESS [4.136s]
[INFO] Spark Project Networking .. SUCCESS [7.405s]
[INFO] Spark Project Shuffle Streaming Service ... SUCCESS [5.071s]
[INFO] Spark Project Core  SUCCESS [3:08.445s]
[INFO] Spark Project Bagel ... SUCCESS [21.613s]
[INFO] Spark Project GraphX .. SUCCESS [58.915s]
[INFO] Spark Project Streaming ... SUCCESS [1:26.202s]
[INFO] Spark Project Catalyst  FAILURE [1.537s]
[INFO] Spark Project SQL . SKIPPED
[INFO] Spark Project ML Library .. SKIPPED
[INFO] Spark Project Tools ... SKIPPED
[INFO] Spark Project Hive  SKIPPED
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Flume Sink . SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project Examples  SKIPPED
it seems like there is something wrong with calatlyst project. Why i cannot 
compile this project?

--发件人：Sean Owen 
so...@cloudera.com发送时间：2015年4月3日(星期五) 17:48收件人：myelinji 
myeli...@aliyun.com抄　送：spark用户组 user@spark.apache.org主　题：Re: maven compile 
error
If you're asking about a compile error, you should include the command
you used to compile.

I am able to compile branch 1.2 successfully with mvn -DskipTests
clean package.

This error is actually an error from scalac, not a compile error from
the code. It sort of sounds like it has not been able to download
scala dependencies. Check or maybe recreate your environment.

On Fri, Apr 3, 2015 at 3:19 AM, myelinji myeli...@aliyun.com wrote:
 Hi,all:
Just now i checked out spark-1.2 on github , wanna to build it use maven,
 how ever I encountered an error during compiling:

 [INFO]
 
 [ERROR] Failed to execute goal
 net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on
 project spark-catalyst_2.10: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found. - [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile
 (scala-compile-first) on project spark-catalyst_2.10: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found.
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
 at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
 at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
 at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
 at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
 at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
 Caused by: org.apache.maven.plugin.MojoExecutionException: wrap:
 scala.reflect.internal.MissingRequirementError: object scala.runtime in
 compiler mirror not found.
 at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:490)
 at

Re: Tableau + Spark SQL Thrift Server + Cassandra

Hi Mohammed,

Not sure if you have tried this or not.  You could try using the below api
to start the thriftserver with an existing context.

https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

So something like this based on example from Cheng Lian:

*Server*

import  org.apache.spark.sql.hive.HiveContext
import  org.apache.spark.sql.catalyst.types._

val  sparkContext  =  sc
import  sparkContext._
val  sqlContext  =  new  HiveContext(sparkContext)
import  sqlContext._
makeRDD((1,hello) :: (2,world)
::Nil).toSchemaRDD.cache().registerTempTable(t)
// replace the above with the C* + spark-casandra-connectore to
generate SchemaRDD and registerTempTable

import  org.apache.spark.sql.hive.thriftserver._
HiveThriftServer2.startWithContext(sqlContext)

Then Startup

./bin/beeline -u jdbc:hive2://localhost:1/default
0: jdbc:hive2://localhost:1/default select * from t;


I have not tried this yet from Tableau.   My understanding is that the
tempTable is only valid as long as the sqlContext is, so if one terminates
the code representing the *Server*, and then restarts the standard thrift
server, sbin/start-thriftserver ..., the table won't be available.

Another possibility is to perhaps use the tuplejump cash project,
https://github.com/tuplejump/cash.

HTH.

-Todd

On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using DSE.
 Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once
 you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the stock
 Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.



 FWIW,
 http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
 provide all the goodness of Spark.  Are you attempting to leverage the
 spark-cassandra-connector for this?







 On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed

Re: Tableau + Spark SQL Thrift Server + Cassandra

@Pawan

Not sure if you have seen this or not, but here is a good example by
Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
Tableau is as simple as Mohammed stated with DSE.
https://github.com/jlacefie/sparksqltest.

HTH,
Todd

On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Mohammed,

 Not sure if you have tried this or not.  You could try using the below api
 to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer
 here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:

 *Server*

 import  org.apache.spark.sql.hive.HiveContext
 import  org.apache.spark.sql.catalyst.types._

 val  sparkContext  =  sc
 import  sparkContext._
 val  sqlContext  =  new  HiveContext(sparkContext)
 import  sqlContext._
 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)
 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable

 import  org.apache.spark.sql.hive.thriftserver._
 HiveThriftServer2.startWithContext(sqlContext)

 Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default
 0: jdbc:hive2://localhost:1/default select * from t;


 I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard thrift
 server, sbin/start-thriftserver ..., the table won't be available.

 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.

 HTH.

 -Todd

 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using
 DSE. Create a new DSN in Tableau using the ODBC driver that comes with DSE.
 Once you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the stock
 Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.



 FWIW,
 http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
 provide all the goodness of Spark.  Are you attempting to leverage the
 spark-cassandra-connector for this?







 On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Hi –



 Is anybody using Tableau to analyze data in Cassandra through the Spark
 SQL Thrift Server?



 Thanks!



 Mohammed

Re: Matei Zaharai: Reddit Ask Me Anything

2015-04-03 Thread ben lorica

Happening right now

   
https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Matei-Zaharai-Reddit-Ask-Me-Anything-tp22364p22369.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: About Waiting batches on the spark streaming UI

Very good question! This is because the current code is written such that
the ui considers a batch as waiting only when it has actually started being
processed. Thats batched waiting in the job queue is not considered in the
calculation. It is arguable that it may be more intuitive to count that in
the waiting as well.
On Apr 3, 2015 12:59 AM, bit1...@163.com bit1...@163.com wrote:


 I copied the following from the spark streaming UI, I don't know why the
 Waiting batches is 1, my understanding is that it should be 72.
 Following  is my understanding:
 1. Total time is 1minute 35 seconds=95 seconds
 2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds.
 3. Processed batches are 23(Correct, because in my processing code, it
 does nothing but sleep 4 seconds)
 4. Then the waiting batches should be 95-23=72



- *Started at: * Fri Apr 03 15:17:47 CST 2015
- *Time since start: *1 minute 35 seconds
- *Network receivers: *1
- *Batch interval: *1 second
- *Processed batches: *23
- *Waiting batches: *1
- *Received records: *0
- *Processed records: *0


 --
 bit1...@163.com

MLlib: save models to HDFS?

2015-04-03 Thread S. Zhou

I am new to MLib so I have a basic question: is it possible to save MLlib 
models (particularly CF models) to HDFS and then reload it later? If yes, could 
u share some sample code (I could not find it in MLlib tutorial). Thanks!

Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Ted Yu

Maybe add another stat for batches waiting in the job queue ?

Cheers

On Fri, Apr 3, 2015 at 10:01 AM, Tathagata Das t...@databricks.com wrote:

 Very good question! This is because the current code is written such that
 the ui considers a batch as waiting only when it has actually started being
 processed. Thats batched waiting in the job queue is not considered in the
 calculation. It is arguable that it may be more intuitive to count that in
 the waiting as well.
 On Apr 3, 2015 12:59 AM, bit1...@163.com bit1...@163.com wrote:


 I copied the following from the spark streaming UI, I don't know why the
 Waiting batches is 1, my understanding is that it should be 72.
 Following  is my understanding:
 1. Total time is 1minute 35 seconds=95 seconds
 2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds.
 3. Processed batches are 23(Correct, because in my processing code, it
 does nothing but sleep 4 seconds)
 4. Then the waiting batches should be 95-23=72



- *Started at: * Fri Apr 03 15:17:47 CST 2015
- *Time since start: *1 minute 35 seconds
- *Network receivers: *1
- *Batch interval: *1 second
- *Processed batches: *23
- *Waiting batches: *1
- *Received records: *0
- *Processed records: *0


 --
 bit1...@163.com

Re: About Waiting batches on the spark streaming UI

Maybe that should be marked as waiting as well. Will keep that in mind. We
plan to update the ui soon, so will keep that in mind.
On Apr 3, 2015 10:12 AM, Ted Yu yuzhih...@gmail.com wrote:

 Maybe add another stat for batches waiting in the job queue ?

 Cheers

 On Fri, Apr 3, 2015 at 10:01 AM, Tathagata Das t...@databricks.com
 wrote:

 Very good question! This is because the current code is written such that
 the ui considers a batch as waiting only when it has actually started being
 processed. Thats batched waiting in the job queue is not considered in the
 calculation. It is arguable that it may be more intuitive to count that in
 the waiting as well.
 On Apr 3, 2015 12:59 AM, bit1...@163.com bit1...@163.com wrote:


 I copied the following from the spark streaming UI, I don't know why the
 Waiting batches is 1, my understanding is that it should be 72.
 Following  is my understanding:
 1. Total time is 1minute 35 seconds=95 seconds
 2. Batch interval is 1 second, so, 95 batches are generated in 95
 seconds.
 3. Processed batches are 23(Correct, because in my processing code, it
 does nothing but sleep 4 seconds)
 4. Then the waiting batches should be 95-23=72



- *Started at: * Fri Apr 03 15:17:47 CST 2015
- *Time since start: *1 minute 35 seconds
- *Network receivers: *1
- *Batch interval: *1 second
- *Processed batches: *23
- *Waiting batches: *1
- *Received records: *0
- *Processed records: *0


 --
 bit1...@163.com

Re: Spark + Kinesis

2015-04-03 Thread Kelly, Jonathan

spark-streaming-kinesis-asl is not part of the Spark distribution on your 
cluster, so you cannot have it be just a provided dependency.  This is also 
why the KCL and its dependencies were not included in the assembly (but yes, 
they should be).

~ Jonathan Kelly

From: Vadim Bichutskiy 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Friday, April 3, 2015 at 12:26 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark + Kinesis

Hi all,

Good news! I was able to create a Kinesis consumer and assemble it into an 
uber jar following 
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and 
example 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala.

However when I try to spark-submit it I get the following exception:

Exception in thread main java.lang.NoClassDefFoundError: 
com/amazonaws/auth/AWSCredentialsProvider

Do I need to include KCL dependency in build.sbt, here's what it looks like 
currently:

import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided
libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0 % provided

assemblySettings
jarName in assembly :=  consumer-assembly.jar
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)

Any help appreciated.

Thanks,
Vadim
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=3d9e0d72-3cbe-4d6f-b262-829b92632515]ᐧ

On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
It looks like you're attempting to mix Scala versions, so that's going to cause 
some problems.  If you really want to use Scala 2.11.5, you must also use Spark 
package versions built for Scala 2.11 rather than 2.10.  Anyway, that's not 
quite the correct way to specify Scala dependencies in build.sbt.  Instead of 
placing the Scala version after the artifactId (like spark-core_2.10), what 
you actually want is to use just spark-core with two percent signs before it. 
 Using two percent signs will make it use the version of Scala that matches 
your declared scalaVersion.  For example:

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided

libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0

I think that may get you a little closer, though I think you're probably going 
to run into the same problems I ran into in this thread: 
https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I never 
really got an answer for that, and I temporarily moved on to other things for 
now.

~ Jonathan Kelly

From: 'Vadim Bichutskiy' 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Thursday, April 2, 2015 at 9:53 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark + Kinesis

Hi all,

I am trying to write an Amazon Kinesis consumer Scala app that processes data 
in the
Kinesis stream. Is this the correct way to specify build.sbt:

---
import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % 
provided,
org.apache.spark % spark-streaming_2.10 % 1.3.0
org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)

assemblySettings
jarName in assembly :=  consumer-assembly.jar
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)

In project/assembly.sbt I have only the following line:

addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)

I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book.

Thanks,
Vadim

Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy

Remove provided and got the following error:

[error] (*:assembly) deduplicate: different file contents found in the
following:

[error]
/Users/vb/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:com/esotericsoftware/minlog/Log$Logger.class

[error]
/Users/vb/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:com/esotericsoftware/minlog/Log$Logger.class
ᐧ

On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t...@databricks.com wrote:

 Just remove provided for spark-streaming-kinesis-asl

 libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl
 % 1.3.0

 On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy 
 vadim.bichuts...@gmail.com wrote:

 Thanks. So how do I fix it?
 ᐧ

 On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

   spark-streaming-kinesis-asl is not part of the Spark distribution on
 your cluster, so you cannot have it be just a provided dependency.  This
 is also why the KCL and its dependencies were not included in the assembly
 (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy vadim.bichuts...@gmail.com
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it
 into an uber jar following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it
 looks like currently:

  import AssemblyKeys._
 name := Kinesis Consumer
 version := 1.0
 organization := com.myconsumer
 scalaVersion := 2.11.5

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0
 % provided
 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0 % provided

  assemblySettings
 jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)

  Any help appreciated.

  Thanks,
 Vadim

 On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

  It looks like you're attempting to mix Scala versions, so that's
 going to cause some problems.  If you really want to use Scala 2.11.5, you
 must also use Spark package versions built for Scala 2.11 rather than
 2.10.  Anyway, that's not quite the correct way to specify Scala
 dependencies in build.sbt.  Instead of placing the Scala version after the
 artifactId (like spark-core_2.10), what you actually want is to use just
 spark-core with two percent signs before it.  Using two percent signs
 will make it use the version of Scala that matches your declared
 scalaVersion.  For example:

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided

  libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided

  libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

  I think that may get you a little closer, though I think you're
 probably going to run into the same problems I ran into in this thread:
 https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
 never really got an answer for that, and I temporarily moved on to other
 things for now.


  ~ Jonathan Kelly

   From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.com
 Date: Thursday, April 2, 2015 at 9:53 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Spark + Kinesis

   Hi all,

  I am trying to write an Amazon Kinesis consumer Scala app that
 processes data in the
 Kinesis stream. Is this the correct way to specify *build.sbt*:

  ---
 *import AssemblyKeys._*
 *name := Kinesis Consumer*






 *version := 1.0 organization := com.myconsumer scalaVersion :=
 2.11.5 libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10
 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 %
 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)*



 * assemblySettings jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)*
 

  In *project/assembly.sbt* I have only the following line:

  *addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)*

  I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark
 book.

  Thanks,
 Vadim

Re: spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Ankur Chauhan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Thanks! I'll add the JIRA. I'll also try to work on a patch this weekend
.

- -- Ankur Chauhan

On 03/04/2015 13:23, Tim Chen wrote:
 Hi Ankur,
 
 There isn't a way to do that yet, but it's simple to add.
 
 Can you create a JIRA in Spark for this?
 
 Thanks!
 
 Tim
 
 On Fri, Apr 3, 2015 at 1:08 PM, Ankur Chauhan
 achau...@brightcove.com mailto:achau...@brightcove.com wrote:
 
 Hi,
 
 I am trying to figure out if there is a way to tell the mesos 
 scheduler in spark to isolate the workers to a set of mesos slaves 
 that have a given attribute such as `tachyon:true`.
 
 Anyone knows if that is possible or how I could achieve such a
 behavior.
 
 Thanks! -- Ankur Chauhan
 
 -

 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 mailto:user-unsubscr...@spark.apache.org For additional commands,
 e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 

- -- 
- -- Ankur Chauhan
-BEGIN PGP SIGNATURE-

iQEcBAEBAgAGBQJVHxMDAAoJEOSJAMhvLp3LEPAH/1T7Ywu2W2vEZR/f6KbP+xbd
CiECqbgy1lMw0TxK3jyoiGttTL0uDcgoqev5kjaUFaGgcpsbzZg2jiaqM5RagJRv
55HvGXtSXKQ3l5NlRyMsbmRGVu8qoV2qv2qrCQHLKhVc0ipXEQgSjrkDGx9yP397
Dz1tFMsY/bgvQL0nMAm/HwJokv701IDGeFXFNI4GXhLGcARYDHou4bY0nzZq+w8t
V9vEFji4jyroJmacHdX0np3KsA6tzVItD6Wi9tLKr0+UWDw2Fb1HfYK0CPYX+FK8
dEgZ/hKwNolAzfIF6kHyNKEIf6H6GKihdLxaB23Im7QojvgGNBTqfGV4tGoJLPc=
=KyHk
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Tableau + Spark SQL Thrift Server + Cassandra

@Pawan,

So it's been a couple of months since I have had a chance to do anything
with Zeppelin, but here is a link to a post on what I did to get it working
https://groups.google.com/forum/#!topic/zeppelin-developers/mCNdyOXNikI.
This may or may not work with the newer releases from Zeppelin.

-Todd

On Fri, Apr 3, 2015 at 3:02 PM, pawan kumar pkv...@gmail.com wrote:

 Hi Todd,

 Thanks for the help. So i was able to get the DSE working with tableau as
 per the link provided by Mohammed. Now i trying to figure out if i could
 write sparksql queries from tableau and get data from DSE. My end goal is
 to get a web based tool where i could write sql queries which will pull
 data from cassandra.

 With Zeppelin I was able to build and run it in EC2 but not sure if
 configurations are right. I am pointing to a spark master which is a remote
 DSE node and all spark and sparksql dependencies are in the remote node. I
 am not sure if i need to install spark and its dependencies in the webui
 (zepplene) node.

 I am not sure talking about zepplelin in this thread is right.

 Thanks once again for all the help.

 Thanks,
 Pawan Venugopal


 On Fri, Apr 3, 2015 at 11:48 AM, Todd Nist tsind...@gmail.com wrote:

 @Pawan

 Not sure if you have seen this or not, but here is a good example by
 Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
 Tableau is as simple as Mohammed stated with DSE.
 https://github.com/jlacefie/sparksqltest.

 HTH,
 Todd

 On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Mohammed,

 Not sure if you have tried this or not.  You could try using the below
 api to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer
 here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:

 *Server*

 import  org.apache.spark.sql.hive.HiveContext
 import  org.apache.spark.sql.catalyst.types._

 val  sparkContext  =  sc
 import  sparkContext._
 val  sqlContext  =  new  HiveContext(sparkContext)
 import  sqlContext._
 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)
 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable

 import  org.apache.spark.sql.hive.thriftserver._
 HiveThriftServer2.startWithContext(sqlContext)

 Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default
 0: jdbc:hive2://localhost:1/default select * from t;


 I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard
 thrift server, sbin/start-thriftserver ..., the table won't be available.

 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.

 HTH.

 -Todd

 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com
 wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using
 DSE. Create a new DSN in Tableau using the ODBC driver that comes with 
 DSE.
 Once you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am
 using DSE for cassandra. Would you provide me with info on connecting with
 DSE either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql

Re: WordCount example

2015-04-03 Thread Mohit Anchlia

If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
seems to work. I don't understand why though because when I
give spark://ip-10-241-251-232:7077 application seem to bootstrap
successfully, just doesn't create a socket on port ?


On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 I checked the ports using netstat and don't see any connections
 established on that port. Logs show only this:

 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
 app-20150327135048-0002

 Spark ui shows:

 Running Applications
 IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
 app-20150327135048-0002
 http://54.69.225.94:8080/app?appId=app-20150327135048-0002
 NetworkWordCount
 http://ip-10-241-251-232.us-west-2.compute.internal:4040/0512.0 MB2015/03/27
 13:50:48ec2-userWAITING33 s
 Code looks like is being executed:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077

 *public* *static* *void* doWork(String masterUrl){

 SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
 NetworkWordCount);

 JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf, Durations.
 *seconds*(1));

 JavaReceiverInputDStreamString lines = jssc.socketTextStream(localhost,
 );

 System.*out*.println(Successfully created connection);

 *mapAndReduce*(lines);

  jssc.start(); // Start the computation

 jssc.awaitTermination(); // Wait for the computation to terminate

 }

 *public* *static* *void* main(String ...args){

 *doWork*(args[0]);

 }
 And output of the java program after submitting the task:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ec2-user);
 users with modify permissions: Set(ec2-user)
 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
 15/03/27 13:50:46 INFO Remoting: Starting remoting
 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
 :60184]
 15/03/27 13:50:47 INFO Utils: Successfully started service 'sparkDriver'
 on port 60184.
 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20150327135047-5399
 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity 3.5
 GB
 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
 server' on port 57955.
 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.
 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
 http://ip-10-241-251-232.us-west-2.compute.internal:4040
 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
 spark://ip-10-241-251-232:7077...
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
 cluster with app ID app-20150327135048-0002
 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on 58358
 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register BlockManager
 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block manager
 ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
 BlockManagerId(driver, ip-10-241-251-232.us-west-2.compute.internal,
 58358)
 15/03/27 13:50:48 INFO BlockManagerMaster: Registered BlockManager
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after reached minRegisteredResourcesRatio:
 0.0
 15/03/27 13:50:48 INFO ReceiverTracker: ReceiverTracker started
 15/03/27 13:50:48 INFO ForEachDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO ShuffledDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO MappedDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO FlatMappedDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO SocketInputDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO SocketInputDStream: Slide time = 1000 ms
 15/03/27 13:50:48 INFO SocketInputDStream: Storage level =
 StorageLevel(false, false, false, false, 1)
 15/03/27 13:50:48 INFO SocketInputDStream: Checkpoint interval = null
 15/03/27 13:50:48 INFO SocketInputDStream: Remember duration = 1000 ms
 15/03/27 13:50:48

Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy

Thanks. So how do I fix it?
ᐧ

On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com wrote:

   spark-streaming-kinesis-asl is not part of the Spark distribution on
 your cluster, so you cannot have it be just a provided dependency.  This
 is also why the KCL and its dependencies were not included in the assembly
 (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy vadim.bichuts...@gmail.com
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it into
 an uber jar following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it looks
 like currently:

  import AssemblyKeys._
 name := Kinesis Consumer
 version := 1.0
 organization := com.myconsumer
 scalaVersion := 2.11.5

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl
 % 1.3.0 % provided

  assemblySettings
 jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)

  Any help appreciated.

  Thanks,
 Vadim

 On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

  It looks like you're attempting to mix Scala versions, so that's going
 to cause some problems.  If you really want to use Scala 2.11.5, you must
 also use Spark package versions built for Scala 2.11 rather than 2.10.
 Anyway, that's not quite the correct way to specify Scala dependencies in
 build.sbt.  Instead of placing the Scala version after the artifactId (like
 spark-core_2.10), what you actually want is to use just spark-core with
 two percent signs before it.  Using two percent signs will make it use the
 version of Scala that matches your declared scalaVersion.  For example:

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided

  libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided

  libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

  I think that may get you a little closer, though I think you're
 probably going to run into the same problems I ran into in this thread:
 https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
 never really got an answer for that, and I temporarily moved on to other
 things for now.


  ~ Jonathan Kelly

   From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.com
 Date: Thursday, April 2, 2015 at 9:53 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Spark + Kinesis

   Hi all,

  I am trying to write an Amazon Kinesis consumer Scala app that
 processes data in the
 Kinesis stream. Is this the correct way to specify *build.sbt*:

  ---
 *import AssemblyKeys._*
 *name := Kinesis Consumer*






 *version := 1.0 organization := com.myconsumer scalaVersion :=
 2.11.5 libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10
 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 %
 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)*



 * assemblySettings jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)*
 

  In *project/assembly.sbt* I have only the following line:

  *addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)*

  I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book.

  Thanks,
 Vadim

Re: WordCount example

What does the Spark Standalone UI at port 8080 say about number of cores?

On Fri, Apr 3, 2015 at 2:53 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 [ec2-user@ip-10-241-251-232 s_lib]$ cat /proc/cpuinfo |grep process
 processor   : 0
 processor   : 1
 processor   : 2
 processor   : 3
 processor   : 4
 processor   : 5
 processor   : 6
 processor   : 7

 On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das t...@databricks.com wrote:

 How many cores are present in the works allocated to the standalone
 cluster spark://ip-10-241-251-232:7077 ?


 On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
 seems to work. I don't understand why though because when I
 give spark://ip-10-241-251-232:7077 application seem to bootstrap
 successfully, just doesn't create a socket on port ?


 On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 I checked the ports using netstat and don't see any connections
 established on that port. Logs show only this:

 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
 app-20150327135048-0002

 Spark ui shows:

 Running Applications
 IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
 app-20150327135048-0002
 http://54.69.225.94:8080/app?appId=app-20150327135048-0002
 NetworkWordCount
 http://ip-10-241-251-232.us-west-2.compute.internal:4040/0512.0 
 MB2015/03/27
 13:50:48ec2-userWAITING33 s
 Code looks like is being executed:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077

 *public* *static* *void* doWork(String masterUrl){

 SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
 NetworkWordCount);

 JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf,
 Durations.*seconds*(1));

 JavaReceiverInputDStreamString lines = jssc.socketTextStream(
 localhost, );

 System.*out*.println(Successfully created connection);

 *mapAndReduce*(lines);

  jssc.start(); // Start the computation

 jssc.awaitTermination(); // Wait for the computation to terminate

 }

 *public* *static* *void* main(String ...args){

 *doWork*(args[0]);

 }
 And output of the java program after submitting the task:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to:
 ec2-user
 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ec2-user);
 users with modify permissions: Set(ec2-user)
 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
 15/03/27 13:50:46 INFO Remoting: Starting remoting
 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on
 addresses
 :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
 :60184]
 15/03/27 13:50:47 INFO Utils: Successfully started service
 'sparkDriver' on port 60184.
 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20150327135047-5399
 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity
 3.5 GB
 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
 server' on port 57955.
 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.
 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
 http://ip-10-241-251-232.us-west-2.compute.internal:4040
 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
 spark://ip-10-241-251-232:7077...
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
 cluster with app ID app-20150327135048-0002
 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on
 58358
 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register
 BlockManager
 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block
 manager ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
 BlockManagerId(driver, ip-10-241-251-232.us-west-2.compute.internal,
 58358)
 15/03/27 13:50:48 INFO BlockManagerMaster: Registered BlockManager
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after reached minRegisteredResourcesRatio:
 0.0
 15/03/27 13:50:48 INFO ReceiverTracker: ReceiverTracker started

Re: Simple but faster data streaming

I am afraid not. The whole point of Spark Streaming is to make it easy to
do complicated processing on streaming data while interoperating with core
Spark, MLlib, SQL without the operational overheads of maintain 4 different
systems. As a slight cost of achieving that unification, there maybe some
overheads compared to specialized systems that are designed to one specific
thing.

If you have to do something simple that could have been done using Flume,
then the resources needed by the Spark Streaming program shouldn't be too
high. Can you provide more details?

TD

On Thu, Apr 2, 2015 at 11:51 AM, Harut Martirosyan 
harut.martiros...@gmail.com wrote:

 Hi guys.

 Is there a more lightweight way of stream processing with Spark? What we
 want is a simpler way, preferably with no scheduling, which just streams
 the data to destinations multiple.

 We extensively use Spark Core, SQL, Streaming, GraphX, so it's our main
 tool and don't want to add new things to the stack like Storm or Flume, but
 from other side, it really takes much more resources on same streaming than
 our previous setup with Flume, especially if we have multiple destinations
 (triggers multiple actions/scheduling)


 --
 RGRDZ Harut

Re: Spark + Kinesis

2015-04-03 Thread Kelly, Jonathan

Just remove provided from the end of the line where you specify the 
spark-streaming-kinesis-asl dependency.  That will cause that package and all 
of its transitive dependencies (including the KCL, the AWS Java SDK libraries 
and other transitive dependencies) to be included in your uber jar.  They all 
must be in there because they are not part of the Spark distribution in your 
cluster.

However, as I mentioned before, I think making this change might cause you to 
run into the same problems I spoke of in the thread I linked below 
(https://www.mail-archive.com/user@spark.apache.org/msg23891.html), and 
unfortunately I haven't solved that yet.

~ Jonathan Kelly

From: Vadim Bichutskiy 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Friday, April 3, 2015 at 12:45 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark + Kinesis

Thanks. So how do I fix it?
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=51a86f6a-7130-4760-aab3-f4368d8176b9]ᐧ


On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
spark-streaming-kinesis-asl is not part of the Spark distribution on your 
cluster, so you cannot have it be just a provided dependency.  This is also 
why the KCL and its dependencies were not included in the assembly (but yes, 
they should be).

~ Jonathan Kelly

From: Vadim Bichutskiy 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Friday, April 3, 2015 at 12:26 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark + Kinesis

Hi all,

Good news! I was able to create a Kinesis consumer and assemble it into an 
uber jar following 
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and 
example 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala.

However when I try to spark-submit it I get the following exception:

Exception in thread main java.lang.NoClassDefFoundError: 
com/amazonaws/auth/AWSCredentialsProvider

Do I need to include KCL dependency in build.sbt, here's what it looks like 
currently:

import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided
libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0 % provided

assemblySettings
jarName in assembly :=  consumer-assembly.jar
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)

Any help appreciated.

Thanks,
Vadim

On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
It looks like you're attempting to mix Scala versions, so that's going to cause 
some problems.  If you really want to use Scala 2.11.5, you must also use Spark 
package versions built for Scala 2.11 rather than 2.10.  Anyway, that's not 
quite the correct way to specify Scala dependencies in build.sbt.  Instead of 
placing the Scala version after the artifactId (like spark-core_2.10), what 
you actually want is to use just spark-core with two percent signs before it. 
 Using two percent signs will make it use the version of Scala that matches 
your declared scalaVersion.  For example:

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided

libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0

I think that may get you a little closer, though I think you're probably going 
to run into the same problems I ran into in this thread: 
https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I never 
really got an answer for that, and I temporarily moved on to other things for 
now.

~ Jonathan Kelly

From: 'Vadim Bichutskiy' 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Thursday, April 2, 2015 at 9:53 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark + Kinesis

Hi all,

I am trying to write an Amazon Kinesis consumer Scala app that processes data 
in the
Kinesis stream. Is this the correct way to specify build.sbt:

---
import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % 
provided,
org.apache.spark % spark-streaming_2.10 % 1.3.0
org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)

Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Adam Ritter

That doesn't seem like a good solution unfortunately as I would be needing
this to work in a production environment.  Do you know why the limitation
exists for FileInputDStream in the first place?  Unless I'm missing
something important about how some of the internals work I don't see why
this feature could be added in at some point.

On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das t...@databricks.com wrote:

 I sort-a-hacky workaround is to use a queueStream where you can manually
 create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
 that this is for testing only as queueStream does not work with driver
 fautl recovery.

 TD

 On Fri, Apr 3, 2015 at 12:23 PM, adamgerst adamge...@gmail.com wrote:

 So after pulling my hair out for a bit trying to convert one of my
 standard
 spark jobs to streaming I found that FileInputDStream does not support
 nested folders (see the brief mention here

 http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
 the fileStream method returns a FileInputDStream).  So before, for my
 standard job, I was reading from say

 s3n://mybucket/2015/03/02/*log

 And could also modify it to simply get an entire months worth of logs.
 Since the logs are split up based upon their date, when the batch ran for
 the day, I simply passed in a parameter of the date to make sure I was
 reading the correct data

 But since I want to turn this job into a streaming job I need to simply do
 something like

 s3n://mybucket/*log

 This would totally work fine if it were a standard spark application, but
 fails for streaming.  Is there anyway I can get around this limitation?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark + Kinesis

2015-04-03 Thread Daniil Osipov

Assembly settings have an option to exclude jars. You need something
similar to:
assemblyExcludedJars in assembly = (fullClasspath in assembly) map { cp =
val excludes = Set(
  minlog-1.2.jar
)
cp filter { jar = excludes(jar.data.getName) }
  }

in your build file (may need to be refactored into a .scala file)

On Fri, Apr 3, 2015 at 12:57 PM, Vadim Bichutskiy 
vadim.bichuts...@gmail.com wrote:

 Remove provided and got the following error:

 [error] (*:assembly) deduplicate: different file contents found in the
 following:

 [error]
 /Users/vb/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:com/esotericsoftware/minlog/Log$Logger.class

 [error]
 /Users/vb/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:com/esotericsoftware/minlog/Log$Logger.class
 ᐧ

 On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t...@databricks.com wrote:

 Just remove provided for spark-streaming-kinesis-asl

 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

 On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy 
 vadim.bichuts...@gmail.com wrote:

 Thanks. So how do I fix it?
 ᐧ

 On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

   spark-streaming-kinesis-asl is not part of the Spark distribution on
 your cluster, so you cannot have it be just a provided dependency.  This
 is also why the KCL and its dependencies were not included in the assembly
 (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy vadim.bichuts...@gmail.com
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it
 into an uber jar following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it
 looks like currently:

  import AssemblyKeys._
 name := Kinesis Consumer
 version := 1.0
 organization := com.myconsumer
 scalaVersion := 2.11.5

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided
 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0 % provided

  assemblySettings
 jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)

  Any help appreciated.

  Thanks,
 Vadim

 On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

  It looks like you're attempting to mix Scala versions, so that's
 going to cause some problems.  If you really want to use Scala 2.11.5, you
 must also use Spark package versions built for Scala 2.11 rather than
 2.10.  Anyway, that's not quite the correct way to specify Scala
 dependencies in build.sbt.  Instead of placing the Scala version after the
 artifactId (like spark-core_2.10), what you actually want is to use just
 spark-core with two percent signs before it.  Using two percent signs
 will make it use the version of Scala that matches your declared
 scalaVersion.  For example:

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 % provided

  libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided

  libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

  I think that may get you a little closer, though I think you're
 probably going to run into the same problems I ran into in this thread:
 https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
 never really got an answer for that, and I temporarily moved on to other
 things for now.


  ~ Jonathan Kelly

   From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.com
 Date: Thursday, April 2, 2015 at 9:53 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Spark + Kinesis

   Hi all,

  I am trying to write an Amazon Kinesis consumer Scala app that
 processes data in the
 Kinesis stream. Is this the correct way to specify *build.sbt*:

  ---
 *import AssemblyKeys._*
 *name := Kinesis Consumer*






 *version := 1.0 organization := com.myconsumer scalaVersion :=
 2.11.5 libraryDependencies ++= Seq(org.apache.spark % 
 spark-core_2.10
 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 %
 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 
 1.3.0)*



 * assemblySettings jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)*
 

  In

spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Ankur Chauhan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

I am trying to figure out if there is a way to tell the mesos
scheduler in spark to isolate the workers to a set of mesos slaves
that have a given attribute such as `tachyon:true`.

Anyone knows if that is possible or how I could achieve such a behavior.

Thanks!
- -- Ankur Chauhan
-BEGIN PGP SIGNATURE-

iQEcBAEBAgAGBQJVHvMlAAoJEOSJAMhvLp3LaV0H/jtX+KQDyorUESLIKIxFV9KM
QjyPtVquwuZYcwLqCfQbo62RgE/LeTjjxzifTzMM5D6cf4ULBH1TcS3Is2EdOhSm
UTMfJyvK06VFvYMLiGjqN4sBG3DFdamQif18qUJoKXX/Z9cUQO9SaSjIezSq2gd8
0lM3NLEQjsXY5uRJyl9GYDxcFsXPVzt1crXAdrtVsIYAlFmhcrm1n/5+Peix89Oh
vgK1J7e0ei7Rc4/3BR2xr8f9us+Jfqym/xe+45h1YYZxZWrteCa48NOGixuUJjJe
zb1MxNrTFZhPrKFT7pz9kCUZXl7DW5hzoQCH07CXZZI3B7kFS+5rjuEIB9qZXPE=
=cadl
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Regarding MLLIB sparse and dense matrix

2015-04-03 Thread Jeetendra Gangele

Hi All
I am building a logistic regression for matching the person data lets say
two person object is given with their attribute we need to find the score.
that means at side you have 10 millions records and other side we have 1
record , we need to tell which one match with highest score among 1 million.

I am strong the score of similarity algos in dense matrix and considering
this as features. will apply many similarity alogs on one attributes.

Should i use sparse or dense? what happen in dense when score is null or
when some of the attribute is missing?

is there any support for regularized logistic regression ?currently i am
using LogisticRegressionWithSGD.

Regards
jeetendra

Re: variant record by case classes in shell fails?

2015-04-03 Thread Michael Albert

My apologies for following my own post, but a friend just pointed out that if I 
use kryo with reference counting AND copy-and-paste, this runs.
However, if I try to load file, this fails as described below.
I thought load was supposed to be equivalent?
Thanks!-Mike

  From: Michael Albert m_albert...@yahoo.com.INVALID
 To: User user@spark.apache.org 
 Sent: Friday, April 3, 2015 2:45 PM
 Subject: variant record by case classes in shell fails?

Greetings!
For me, the code below fails from the shell.However, I can do essentially the 
same from compiled code, exporting the jar.
If I use default serialization or kryo with reference tracking, the error 
message tells me it can't find the constructor for A.If I use kryo with 
reference tracking, I get a stack overflow.
I'm using Spark 1.2.1 on AWS EMR (hadoop 2.4).
I've also tried putting this code inside an object.
Is this just me?  Am I overlooking something obvious?
Thanks!
-Mike
:paste
sealed class AorBcase class A(i: Int) extends AorBcase class B(i: Int, j: Int) 
extends AorB
sc.parallelize(0.until(1)).map{ _ =    val x = A(1)    x}.collect()

Re: Spark Streaming FileStream Nested File Support

Yes, definitely can be added. Just haven't gotten around to doing it :)
There are proposals for this that you can try -
https://github.com/apache/spark/pull/2765/files . Have you review it at
some point.

On Fri, Apr 3, 2015 at 1:08 PM, Adam Ritter adamge...@gmail.com wrote:

That doesn't seem like a good solution unfortunately as I would be needing
this to work in a production environment. Do you know why the limitation
exists for FileInputDStream in the first place? Unless I'm missing
something important about how some of the internals work I don't see why
this feature could be added in at some point.

On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das t...@databricks.com
wrote:

I sort-a-hacky workaround is to use a queueStream where you can manually
create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
that this is for testing only as queueStream does not work with driver
fautl recovery.

On Fri, Apr 3, 2015 at 12:23 PM, adamgerst adamge...@gmail.com wrote:

So after pulling my hair out for a bit trying to convert one of my
standard
spark jobs to streaming I found that FileInputDStream does not support
nested folders (see the brief mention here

http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
the fileStream method returns a FileInputDStream). So before, for my
standard job, I was reading from say

s3n://mybucket/2015/03/02/*log

And could also modify it to simply get an entire months worth of logs.
Since the logs are split up based upon their date, when the batch ran for
the day, I simply passed in a parameter of the date to make sure I was
reading the correct data

But since I want to turn this job into a streaming job I need to simply
do
something like

s3n://mybucket/*log

This would totally work fine if it were a standard spark application, but
fails for streaming. Is there anyway I can get around this limitation?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming FileStream Nested File Support

2015-04-03 Thread adamgerst

So after pulling my hair out for a bit trying to convert one of my standard
spark jobs to streaming I found that FileInputDStream does not support
nested folders (see the brief mention here
http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
the fileStream method returns a FileInputDStream).  So before, for my
standard job, I was reading from say

s3n://mybucket/2015/03/02/*log

And could also modify it to simply get an entire months worth of logs. 
Since the logs are split up based upon their date, when the batch ran for
the day, I simply passed in a parameter of the date to make sure I was
reading the correct data

But since I want to turn this job into a streaming job I need to simply do
something like

s3n://mybucket/*log

This would totally work fine if it were a standard spark application, but
fails for streaming.  Is there anyway I can get around this limitation?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: WordCount example

2015-04-03 Thread Mohit Anchlia

[ec2-user@ip-10-241-251-232 s_lib]$ cat /proc/cpuinfo |grep process
processor   : 0
processor   : 1
processor   : 2
processor   : 3
processor   : 4
processor   : 5
processor   : 6
processor   : 7

On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das t...@databricks.com wrote:

 How many cores are present in the works allocated to the standalone
 cluster spark://ip-10-241-251-232:7077 ?


 On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
 seems to work. I don't understand why though because when I
 give spark://ip-10-241-251-232:7077 application seem to bootstrap
 successfully, just doesn't create a socket on port ?


 On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 I checked the ports using netstat and don't see any connections
 established on that port. Logs show only this:

 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
 app-20150327135048-0002

 Spark ui shows:

 Running Applications
 IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
 app-20150327135048-0002
 http://54.69.225.94:8080/app?appId=app-20150327135048-0002
 NetworkWordCount
 http://ip-10-241-251-232.us-west-2.compute.internal:4040/0512.0 
 MB2015/03/27
 13:50:48ec2-userWAITING33 s
 Code looks like is being executed:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077

 *public* *static* *void* doWork(String masterUrl){

 SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
 NetworkWordCount);

 JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf,
 Durations.*seconds*(1));

 JavaReceiverInputDStreamString lines = jssc.socketTextStream(
 localhost, );

 System.*out*.println(Successfully created connection);

 *mapAndReduce*(lines);

  jssc.start(); // Start the computation

 jssc.awaitTermination(); // Wait for the computation to terminate

 }

 *public* *static* *void* main(String ...args){

 *doWork*(args[0]);

 }
 And output of the java program after submitting the task:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ec2-user);
 users with modify permissions: Set(ec2-user)
 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
 15/03/27 13:50:46 INFO Remoting: Starting remoting
 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on
 addresses
 :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
 :60184]
 15/03/27 13:50:47 INFO Utils: Successfully started service 'sparkDriver'
 on port 60184.
 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20150327135047-5399
 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity
 3.5 GB
 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
 server' on port 57955.
 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.
 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
 http://ip-10-241-251-232.us-west-2.compute.internal:4040
 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
 spark://ip-10-241-251-232:7077...
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
 cluster with app ID app-20150327135048-0002
 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on 58358
 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register
 BlockManager
 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block
 manager ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
 BlockManagerId(driver, ip-10-241-251-232.us-west-2.compute.internal,
 58358)
 15/03/27 13:50:48 INFO BlockManagerMaster: Registered BlockManager
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after reached minRegisteredResourcesRatio:
 0.0
 15/03/27 13:50:48 INFO ReceiverTracker: ReceiverTracker started
 15/03/27 13:50:48 INFO ForEachDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO ShuffledDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO

Re: Tableau + Spark SQL Thrift Server + Cassandra

Hi Todd,

Thanks for the help. So i was able to get the DSE working with tableau as
per the link provided by Mohammed. Now i trying to figure out if i could
write sparksql queries from tableau and get data from DSE. My end goal is
to get a web based tool where i could write sql queries which will pull
data from cassandra.

With Zeppelin I was able to build and run it in EC2 but not sure if
configurations are right. I am pointing to a spark master which is a remote
DSE node and all spark and sparksql dependencies are in the remote node. I
am not sure if i need to install spark and its dependencies in the webui
(zepplene) node.

I am not sure talking about zepplelin in this thread is right.

Thanks once again for all the help.

Thanks,
Pawan Venugopal


On Fri, Apr 3, 2015 at 11:48 AM, Todd Nist tsind...@gmail.com wrote:

 @Pawan

 Not sure if you have seen this or not, but here is a good example by
 Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
 Tableau is as simple as Mohammed stated with DSE.
 https://github.com/jlacefie/sparksqltest.

 HTH,
 Todd

 On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Mohammed,

 Not sure if you have tried this or not.  You could try using the below
 api to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer
 here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:

 *Server*

 import  org.apache.spark.sql.hive.HiveContext
 import  org.apache.spark.sql.catalyst.types._

 val  sparkContext  =  sc
 import  sparkContext._
 val  sqlContext  =  new  HiveContext(sparkContext)
 import  sqlContext._
 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)
 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable

 import  org.apache.spark.sql.hive.thriftserver._
 HiveThriftServer2.startWithContext(sqlContext)

 Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default
 0: jdbc:hive2://localhost:1/default select * from t;


 I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard
 thrift server, sbin/start-thriftserver ..., the table won't be available.

 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.

 HTH.

 -Todd

 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com
 wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using
 DSE. Create a new DSN in Tableau using the ODBC driver that comes with DSE.
 Once you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 What version of Cassandra are you using?  Are you using DSE or the
 stock Apache Cassandra version?  I have connected it with DSE, but have not
 attempted it with the standard Apache Cassandra version.



 FWIW,

Re: MLlib: save models to HDFS?

2015-04-03 Thread Xiangrui Meng

In 1.3, you can use model.save(sc, hdfs path). You can check the
code examples here:
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples.
-Xiangrui

On Fri, Apr 3, 2015 at 2:17 PM, Justin Yip yipjus...@prediction.io wrote:
 Hello Zhou,

 You can look at the recommendation template of PredictionIO. PredictionIO is
 built on the top of spark. And this template illustrates how you can save
 the ALS model to HDFS and the reload it later.

 Justin


 On Fri, Apr 3, 2015 at 9:16 AM, S. Zhou myx...@yahoo.com.invalid wrote:

 I am new to MLib so I have a basic question: is it possible to save MLlib
 models (particularly CF models) to HDFS and then reload it later? If yes,
 could u share some sample code (I could not find it in MLlib tutorial).
 Thanks!



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: WordCount example

How many cores are present in the works allocated to the standalone cluster
spark://ip-10-241-251-232:7077 ?


On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
 seems to work. I don't understand why though because when I
 give spark://ip-10-241-251-232:7077 application seem to bootstrap
 successfully, just doesn't create a socket on port ?


 On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 I checked the ports using netstat and don't see any connections
 established on that port. Logs show only this:

 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
 app-20150327135048-0002

 Spark ui shows:

 Running Applications
 IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
 app-20150327135048-0002
 http://54.69.225.94:8080/app?appId=app-20150327135048-0002
 NetworkWordCount
 http://ip-10-241-251-232.us-west-2.compute.internal:4040/0512.0 
 MB2015/03/27
 13:50:48ec2-userWAITING33 s
 Code looks like is being executed:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077

 *public* *static* *void* doWork(String masterUrl){

 SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
 NetworkWordCount);

 JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf, Durations.
 *seconds*(1));

 JavaReceiverInputDStreamString lines = jssc.socketTextStream(
 localhost, );

 System.*out*.println(Successfully created connection);

 *mapAndReduce*(lines);

  jssc.start(); // Start the computation

 jssc.awaitTermination(); // Wait for the computation to terminate

 }

 *public* *static* *void* main(String ...args){

 *doWork*(args[0]);

 }
 And output of the java program after submitting the task:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ec2-user);
 users with modify permissions: Set(ec2-user)
 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
 15/03/27 13:50:46 INFO Remoting: Starting remoting
 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
 :60184]
 15/03/27 13:50:47 INFO Utils: Successfully started service 'sparkDriver'
 on port 60184.
 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20150327135047-5399
 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity 3.5
 GB
 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
 server' on port 57955.
 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.
 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
 http://ip-10-241-251-232.us-west-2.compute.internal:4040
 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
 spark://ip-10-241-251-232:7077...
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
 cluster with app ID app-20150327135048-0002
 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on 58358
 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register BlockManager
 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block manager
 ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
 BlockManagerId(driver, ip-10-241-251-232.us-west-2.compute.internal,
 58358)
 15/03/27 13:50:48 INFO BlockManagerMaster: Registered BlockManager
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after reached minRegisteredResourcesRatio:
 0.0
 15/03/27 13:50:48 INFO ReceiverTracker: ReceiverTracker started
 15/03/27 13:50:48 INFO ForEachDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO ShuffledDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO MappedDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO FlatMappedDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO SocketInputDStream: metadataCleanupDelay = -1
 15/03/27 13:50:48 INFO SocketInputDStream: Slide time = 1000 ms
 15/03/27 13:50:48 INFO SocketInputDStream: Storage level =

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Mohammed Guller

Thanks, Todd.

It is an interesting idea; worth trying.

I think the cash project is old. The tuplejump guy has created another project 
called CalliopeServer2, which works like a charm with BI tools that use JDBC, 
but unfortunately Tableau throws an error when it connects to it.

Mohammed

From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Friday, April 3, 2015 11:39 AM
To: pawan kumar
Cc: Mohammed Guller; user@spark.apache.org
Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra

Hi Mohammed,

Not sure if you have tried this or not.  You could try using the below api to 
start the thriftserver with an existing context.

https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

The one thing that Michael Ambrust @ databrick recommended was this:
You can start a JDBC server with an existing context.  See my answer here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

So something like this based on example from Cheng Lian:

Server

import  org.apache.spark.sql.hive.HiveContext

import  org.apache.spark.sql.catalyst.types._



val  sparkContext  =  sc

import  sparkContext._

val  sqlContext  =  new  HiveContext(sparkContext)

import  sqlContext._

makeRDD((1,hello) :: (2,world) 
::Nil).toSchemaRDD.cache().registerTempTable(t)

// replace the above with the C* + spark-casandra-connectore to generate 
SchemaRDD and registerTempTable



import  org.apache.spark.sql.hive.thriftserver._

HiveThriftServer2.startWithContext(sqlContext)
Then Startup

./bin/beeline -u jdbc:hive2://localhost:1/default

0: jdbc:hive2://localhost:1/default select * from t;


I have not tried this yet from Tableau.   My understanding is that the 
tempTable is only valid as long as the sqlContext is, so if one terminates the 
code representing the Server, and then restarts the standard thrift server, 
sbin/start-thriftserver ..., the table won't be available.

Another possibility is to perhaps use the tuplejump cash project, 
https://github.com/tuplejump/cash.

HTH.

-Todd

On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar 
pkv...@gmail.commailto:pkv...@gmail.com wrote:

Thanks mohammed. Will give it a try today. We would also need the sparksSQL 
piece as we are migrating our data store from oracle to C* and it would be 
easier to maintain all the reports rather recreating each one from scratch.

Thanks,
Pawan Venugopal.
On Apr 3, 2015 7:59 AM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Hi Todd,

We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly with C* 
using the ODBC driver, but now would like to add Spark SQL to the mix. I 
haven’t been able to find any documentation for how to make this combination 
work.

We are using the Spark-Cassandra-Connector in our applications, but haven’t 
been able to figure out how to get the Spark SQL Thrift Server to use it and 
connect to C*. That is the missing piece. Once we solve that piece of the 
puzzle then Tableau should be able to see the tables in C*.

Hi Pawan,
Tableau + C* is pretty straight forward, especially if you are using DSE. 
Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once you 
connect, Tableau allows to use C* keyspace as schema and column families as 
tables.

Mohammed

From: pawan kumar [mailto:pkv...@gmail.commailto:pkv...@gmail.com]
Sent: Friday, April 3, 2015 7:41 AM
To: Todd Nist
Cc: user@spark.apache.orgmailto:user@spark.apache.org; Mohammed Guller
Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra


Hi Todd,

Thanks for the link. I would be interested in this solution. I am using DSE for 
cassandra. Would you provide me with info on connecting with DSE either through 
Tableau or zeppelin. The goal here is query cassandra through spark sql so that 
I could perform joins and groupby on my queries. Are you able to perform spark 
sql queries with tableau?

Thanks,
Pawan Venugopal
On Apr 3, 2015 5:03 AM, Todd Nist 
tsind...@gmail.commailto:tsind...@gmail.com wrote:
What version of Cassandra are you using?  Are you using DSE or the stock Apache 
Cassandra version?  I have connected it with DSE, but have not attempted it 
with the standard Apache Cassandra version.

FWIW, 
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not 
provide all the goodness of Spark.  Are you attempting to leverage the 
spark-cassandra-connector for this?



On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Hi –

Is anybody using Tableau to analyze data in Cassandra through the Spark SQL 
Thrift Server?

Thanks!

Mohammed

Re: Tableau + Spark SQL Thrift Server + Cassandra

@Todd,

I had looked at it yesterday. All these dependencies explained is added in
the DSE node. Do I need to include spark and DSE dependencies in the
Zeppline node?

I built zeppelin with no spark and no hadoop. To my understanding zeppelin
will send a request to a remote master at spark://
ec2-54-163-181-25.compute-1.amazonaws.com:7077  which is a dse box. This
box has all the dependencies. Do i need to install all the dependencies in
zeppline box.

Thanks,
Pawan Venugopal

On Fri, Apr 3, 2015 at 12:08 PM, Todd Nist tsind...@gmail.com wrote:

 @Pawan,

 So it's been a couple of months since I have had a chance to do anything
 with Zeppelin, but here is a link to a post on what I did to get it working
 https://groups.google.com/forum/#!topic/zeppelin-developers/mCNdyOXNikI.
 This may or may not work with the newer releases from Zeppelin.

 -Todd

 On Fri, Apr 3, 2015 at 3:02 PM, pawan kumar pkv...@gmail.com wrote:

 Hi Todd,

 Thanks for the help. So i was able to get the DSE working with tableau as
 per the link provided by Mohammed. Now i trying to figure out if i could
 write sparksql queries from tableau and get data from DSE. My end goal is
 to get a web based tool where i could write sql queries which will pull
 data from cassandra.

 With Zeppelin I was able to build and run it in EC2 but not sure if
 configurations are right. I am pointing to a spark master which is a remote
 DSE node and all spark and sparksql dependencies are in the remote node. I
 am not sure if i need to install spark and its dependencies in the webui
 (zepplene) node.

 I am not sure talking about zepplelin in this thread is right.

 Thanks once again for all the help.

 Thanks,
 Pawan Venugopal


 On Fri, Apr 3, 2015 at 11:48 AM, Todd Nist tsind...@gmail.com wrote:

 @Pawan

 Not sure if you have seen this or not, but here is a good example by
 Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
 Tableau is as simple as Mohammed stated with DSE.
 https://github.com/jlacefie/sparksqltest.

 HTH,
 Todd

 On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist tsind...@gmail.com wrote:

 Hi Mohammed,

 Not sure if you have tried this or not.  You could try using the below
 api to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

 You can start a JDBC server with an existing context.  See my answer
 here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:

 *Server*

 import  org.apache.spark.sql.hive.HiveContext
 import  org.apache.spark.sql.catalyst.types._

 val  sparkContext  =  sc
 import  sparkContext._
 val  sqlContext  =  new  HiveContext(sparkContext)
 import  sqlContext._
 makeRDD((1,hello) :: (2,world) 
 ::Nil).toSchemaRDD.cache().registerTempTable(t)
 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable

 import  org.apache.spark.sql.hive.thriftserver._
 HiveThriftServer2.startWithContext(sqlContext)

 Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default
 0: jdbc:hive2://localhost:1/default select * from t;


 I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard
 thrift server, sbin/start-thriftserver ..., the table won't be available.

 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.

 HTH.

 -Todd

 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar pkv...@gmail.com wrote:

 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and 
 it
 would be easier to maintain all the reports rather recreating each one 
 from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com
 wrote:

  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work
 directly with C* using the ODBC driver, but now would like to add Spark 
 SQL
 to the mix. I haven’t been able to find any documentation for how to make
 this combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using
 DSE. Create a new DSN in Tableau using the ODBC driver that comes with 
 DSE.
 Once you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.

Re: Spark + Kinesis

Just remove provided for spark-streaming-kinesis-asl

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl
% 1.3.0

On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy 
vadim.bichuts...@gmail.com wrote:

 Thanks. So how do I fix it?
 ᐧ

 On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

   spark-streaming-kinesis-asl is not part of the Spark distribution on
 your cluster, so you cannot have it be just a provided dependency.  This
 is also why the KCL and its dependencies were not included in the assembly
 (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy vadim.bichuts...@gmail.com
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it into
 an uber jar following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it
 looks like currently:

  import AssemblyKeys._
 name := Kinesis Consumer
 version := 1.0
 organization := com.myconsumer
 scalaVersion := 2.11.5

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0
 % provided
 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0 % provided

  assemblySettings
 jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)

  Any help appreciated.

  Thanks,
 Vadim

 On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

  It looks like you're attempting to mix Scala versions, so that's going
 to cause some problems.  If you really want to use Scala 2.11.5, you must
 also use Spark package versions built for Scala 2.11 rather than 2.10.
 Anyway, that's not quite the correct way to specify Scala dependencies in
 build.sbt.  Instead of placing the Scala version after the artifactId (like
 spark-core_2.10), what you actually want is to use just spark-core with
 two percent signs before it.  Using two percent signs will make it use the
 version of Scala that matches your declared scalaVersion.  For example:

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided

  libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided

  libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

  I think that may get you a little closer, though I think you're
 probably going to run into the same problems I ran into in this thread:
 https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
 never really got an answer for that, and I temporarily moved on to other
 things for now.


  ~ Jonathan Kelly

   From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.com
 Date: Thursday, April 2, 2015 at 9:53 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Spark + Kinesis

   Hi all,

  I am trying to write an Amazon Kinesis consumer Scala app that
 processes data in the
 Kinesis stream. Is this the correct way to specify *build.sbt*:

  ---
 *import AssemblyKeys._*
 *name := Kinesis Consumer*






 *version := 1.0 organization := com.myconsumer scalaVersion :=
 2.11.5 libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10
 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 %
 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)*



 * assemblySettings jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)*
 

  In *project/assembly.sbt* I have only the following line:

  *addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)*

  I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark
 book.

  Thanks,
 Vadim

Re: Spark Streaming FileStream Nested File Support

I sort-a-hacky workaround is to use a queueStream where you can manually
create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
that this is for testing only as queueStream does not work with driver
fautl recovery.

TD

On Fri, Apr 3, 2015 at 12:23 PM, adamgerst adamge...@gmail.com wrote:

 So after pulling my hair out for a bit trying to convert one of my standard
 spark jobs to streaming I found that FileInputDStream does not support
 nested folders (see the brief mention here

 http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
 the fileStream method returns a FileInputDStream).  So before, for my
 standard job, I was reading from say

 s3n://mybucket/2015/03/02/*log

 And could also modify it to simply get an entire months worth of logs.
 Since the logs are split up based upon their date, when the batch ran for
 the day, I simply passed in a parameter of the date to make sure I was
 reading the correct data

 But since I want to turn this job into a streaming job I need to simply do
 something like

 s3n://mybucket/*log

 This would totally work fine if it were a standard spark application, but
 fails for streaming.  Is there anyway I can get around this limitation?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Tim Chen

Hi Ankur,

There isn't a way to do that yet, but it's simple to add.

Can you create a JIRA in Spark for this?

Thanks!

Tim

On Fri, Apr 3, 2015 at 1:08 PM, Ankur Chauhan achau...@brightcove.com
wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi,

 I am trying to figure out if there is a way to tell the mesos
 scheduler in spark to isolate the workers to a set of mesos slaves
 that have a given attribute such as `tachyon:true`.

 Anyone knows if that is possible or how I could achieve such a behavior.

 Thanks!
 - -- Ankur Chauhan
 -BEGIN PGP SIGNATURE-

 iQEcBAEBAgAGBQJVHvMlAAoJEOSJAMhvLp3LaV0H/jtX+KQDyorUESLIKIxFV9KM
 QjyPtVquwuZYcwLqCfQbo62RgE/LeTjjxzifTzMM5D6cf4ULBH1TcS3Is2EdOhSm
 UTMfJyvK06VFvYMLiGjqN4sBG3DFdamQif18qUJoKXX/Z9cUQO9SaSjIezSq2gd8
 0lM3NLEQjsXY5uRJyl9GYDxcFsXPVzt1crXAdrtVsIYAlFmhcrm1n/5+Peix89Oh
 vgK1J7e0ei7Rc4/3BR2xr8f9us+Jfqym/xe+45h1YYZxZWrteCa48NOGixuUJjJe
 zb1MxNrTFZhPrKFT7pz9kCUZXl7DW5hzoQCH07CXZZI3B7kFS+5rjuEIB9qZXPE=
 =cadl
 -END PGP SIGNATURE-

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: MLlib: save models to HDFS?

2015-04-03 Thread Justin Yip

Hello Zhou,

You can look at the recommendation template
http://templates.prediction.io/PredictionIO/template-scala-parallel-recommendation
of PredictionIO. PredictionIO is built on the top of spark. And this
template illustrates how you can save the ALS model to HDFS and the reload
it later.

Justin


On Fri, Apr 3, 2015 at 9:16 AM, S. Zhou myx...@yahoo.com.invalid wrote:

 I am new to MLib so I have a basic question: is it possible to save MLlib
 models (particularly CF models) to HDFS and then reload it later? If yes,
 could u share some sample code (I could not find it in MLlib tutorial).
 Thanks!

Spark TeraSort source request

2015-04-03 Thread Tom

Hi all,

As we all know, Spark has set the record for sorting data, as published on:
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

Here at our group, we would love to verify these results, and compare
machine using this benchmark. We've spend quite some time trying to find the
terasort source code that was used, but can not find it anywhere.

We did find two candidates:

A version posted by Reynold [1], the posted of the message above. This
version is stuck at // TODO: Add partition-local (external) sorting
using TeraSortRecordOrdering, only generating data.

Here, Ewan noticed that it didn't appear to be similar to Hadoop TeraSort.
[2] After this he created a version on his own [3]. With this version, we
noticed problems with TeraValidate with datasets above ~10G (as mentioned by
others at [4]. When examining the raw input and output files, it actually
appears that the input data is sorted and the output data unsorted in both
cases.

Because of this, we believe we did not yet find the actual used source code.
I've tried to search in the Spark User forum archive's, seeing request of
people, indicating a demand, but did not succeed in finding the actual
source code.

My question:
Could you guys please make the source code of the used TeraSort program,
preferably with settings, available? If not, what are the reasons that this
seems to be withheld?

Thanks for any help,

Tom Hubregtsen

[1]
https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
[2]
http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
[3] https://github.com/ehiggs/spark-terasort
[4]
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Regarding MLLIB sparse and dense matrix

2015-04-03 Thread Joseph Bradley

If you can examine your data matrix and know that about  1/6 or so of the
values are non-zero (so  5/6 are zeros), then it's probably worth using
sparse vectors.  (1/6 is a rough estimate.)

There is support for L1 and L2 regularization.  You can look at the guide
here:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression
and the API docs linked from the menu.

On Fri, Apr 3, 2015 at 1:24 PM, Jeetendra Gangele gangele...@gmail.com
wrote:

 Hi All
 I am building a logistic regression for matching the person data lets say
 two person object is given with their attribute we need to find the score.
 that means at side you have 10 millions records and other side we have 1
 record , we need to tell which one match with highest score among 1 million.

 I am strong the score of similarity algos in dense matrix and considering
 this as features. will apply many similarity alogs on one attributes.

 Should i use sparse or dense? what happen in dense when score is null or
 when some of the attribute is missing?

 is there any support for regularized logistic regression ?currently i am
 using LogisticRegressionWithSGD.

 Regards
 jeetendra

Migrating from Spark 0.8.0 to Spark 1.3.0

2015-04-03 Thread Ritesh Kumar Singh

Hi,

Are there any tutorials that explains all the changelogs between Spark
0.8.0 and Spark 1.3.0 and how can we approach this issue.

Re: Tableau + Spark SQL Thrift Server + Cassandra