Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Mohammed Guller
I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 worker). Our app is fetching data from Cassandra and doing a basic filter, map, and countByKey on that data. I have run into a strange problem. Even if the number of rows in Cassandra is just 1M, the Spark job goes seems

RE: How to use groupByKey and CqlPagingInputFormat

2014-07-04 Thread Mohammed Guller
with the Datastax spark driver? Mohammed -Original Message- From: Martin Gammelsæter [mailto:martingammelsae...@gmail.com] Sent: Friday, July 4, 2014 12:43 AM To: user@spark.apache.org Subject: Re: How to use groupByKey and CqlPagingInputFormat On Thu, Jul 3, 2014 at 10:29 PM, Mohammed Guller

Spark SQL parser bug?

2014-10-08 Thread Mohammed Guller
Hi - When I run the following Spark SQL query in Spark-shell ( version 1.1.0) : val rdd = sqlContext.sql(SELECT a FROM x WHERE ts = '2012-01-01T00:00:00' AND ts = '2012-03-31T23:59:59' ) it gives the following error: rdd: org.apache.spark.sql.SchemaRDD = SchemaRDD[294] at RDD at

RE: Spark SQL parser bug?

2014-10-10 Thread Mohammed Guller
, 2014 4:37 AM To: Mohammed Guller; user@spark.apache.org Subject: Re: Spark SQL parser bug? Hi Mohammed, Would you mind to share the DDL of the table x and the complete stacktrace of the exception you got? A full Spark shell session history would be more than helpful. PR #2084 had been merged

RE: Spark SQL parser bug?

2014-10-11 Thread Mohammed Guller
[a#0,ts#1], MapPartitionsRDD[37] at mapPartitions at basicOperators.scala:208 scala sRdd.collect res10: Array[org.apache.spark.sql.Row] = Array() Mohammed From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Friday, October 10, 2014 10:14 PM To: Mohammed Guller; user@spark.apache.org Subject

RE: Spark SQL parser bug?

2014-10-13 Thread Mohammed Guller
From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Sunday, October 12, 2014 1:35 AM To: Mohammed Guller; Cheng Lian; user@spark.apache.org Subject: RE: Spark SQL parser bug? Hi, I couldn’t reproduce the bug with the latest master branch. Which version are you using? Can you also list data

RE: Spark SQL parser bug?

2014-10-13 Thread Mohammed Guller
Plan == Project [a#2] ExistingRdd [a#2,ts#3], MapPartitionsRDD[22] at mapPartitions at basicOperators.scala:208 scala s.collect res5: Array[org.apache.spark.sql.Row] = Array() Mohammed From: Yin Huai [mailto:huaiyin@gmail.com] Sent: Monday, October 13, 2014 7:19 AM To: Mohammed Guller Cc

RE: Spark SQL parser bug?

2014-10-13 Thread Mohammed Guller
That explains it. Thanks! Mohammed From: Yin Huai [mailto:huaiyin@gmail.com] Sent: Monday, October 13, 2014 8:47 AM To: Mohammed Guller Cc: Cheng, Hao; Cheng Lian; user@spark.apache.org Subject: Re: Spark SQL parser bug? Yeah, it is not related to timezone. I think you hit this issuehttps

Play framework

2014-10-15 Thread Mohammed Guller
Hi - Has anybody figured out how to integrate a Play application with Spark and run it on a Spark cluster using spark-submit script? I have seen some blogs about creating a simple Play app and running it locally on a dev machine with sbt run command. However, those steps don't work for

RE: Play framework

2014-10-16 Thread Mohammed Guller
that piece of code? Also is there any specific reason why you are not using play dist instead? Mohammed From: US Office Admin [mailto:ad...@vectorum.com] Sent: Thursday, October 16, 2014 11:41 AM To: Surendranauth Hiraman; Mohammed Guller Cc: Daniel Siegmann; user@spark.apache.org Subject: Re

RE: Play framework

2014-10-16 Thread Mohammed Guller
To: Mohammed Guller Cc: US Office Admin; Surendranauth Hiraman; Daniel Siegmann; user@spark.apache.org Subject: Re: Play framework Hi, Below is the link for a simple Play + SparkSQL example - http://blog.knoldus.com/2014/07/14/play-with-spark-building-apache-spark-with-play-framework-part-3

RE: Play framework

2014-10-16 Thread Mohammed Guller
What about all the play dependencies since the jar created by the ‘Play package’ won’t include the play jar or any of the 100+ jars on which play itself depends? Mohammed From: US Office Admin [mailto:ad...@vectorum.com] Sent: Thursday, October 16, 2014 7:05 PM To: Mohammed Guller

RE: Play framework

2014-10-16 Thread Mohammed Guller
To: Mohammed Guller Cc: US Office Admin; Surendranauth Hiraman; Daniel Siegmann; user@spark.apache.org Subject: Re: Play framework In our case, Play libraries are not required to run spark jobs. Hence they are available only on master and play runs as a regular scala application. I can't think

RE: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Mohammed Guller
Try a version built with Akka 2.2.x Mohammed From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Tuesday, October 28, 2014 3:03 AM To: user Subject: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext Hi, I got the following exceptions

how to retrieve the value of a column of type date/timestamp from a Spark SQL Row

2014-10-28 Thread Mohammed Guller
Hi - The Spark SQL Row class has methods such as getInt, getLong, getBoolean, getFloat, getDouble, etc. However, I don't see a getDate method. So how can one retrieve a date/timestamp type column from a result set? Thanks, Mohammed

RE: how to retrieve the value of a column of type date/timestamp from a Spark SQL Row

2014-10-29 Thread Mohammed Guller
:23 PM To: Zhan Zhang Cc: Mohammed Guller; user@spark.apache.org Subject: Re: how to retrieve the value of a column of type date/timestamp from a Spark SQL Row Or def getAs[T](i: Int): T Best Regards, Shixiong Zhu 2014-10-29 13:16 GMT+08:00 Zhan Zhang zzh...@hortonworks.commailto:zzh

RE: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-29 Thread Mohammed Guller
I am not sure about that. Can you try a Spray version built with 2.2.x along with Spark 1.1 and include the Akka dependencies in your project’s sbt file? Mohammed From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Tuesday, October 28, 2014 8:58 PM To: Mohammed Guller Cc: user Subject

RE: Spark and Play

2014-11-11 Thread Mohammed Guller
Actually, it is possible to integrate Spark 1.1.0 with Play 2.2.x Here is a sample build.sbt file: name := xyz version := 0.1 scalaVersion := 2.10.4 libraryDependencies ++= Seq( jdbc, anorm, cache, org.apache.spark %% spark-core % 1.1.0, com.typesafe.akka %% akka-actor % 2.2.3,

RE: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Mohammed Guller
David, Here is what I would suggest: 1 - Does a new SparkContext get created in the web tier for each new request for processing? Create a single SparkContext that gets shared across multiple web requests. Depending on the framework that you are using for the web-tier, it should not be

RE: Spark and Play

2014-11-13 Thread Mohammed Guller
Meehan [mailto:jnmee...@gmail.com] Sent: Tuesday, November 11, 2014 11:35 PM To: Mohammed Guller Cc: Patrick Wendell; Akshat Aranya; user@spark.apache.org Subject: Re: Spark and Play You can also build a Play 2.2.x + Spark 1.1.0 fat jar with sbt-assembly for, e.g. yarn-client support or using

querying data from Cassandra through the Spark SQL Thrift JDBC server

2014-11-19 Thread Mohammed Guller
Hi - I was curious if anyone is using the Spark SQL Thrift JDBC server with Cassandra. It would be great be if you could share how you got it working? For example, what config changes have to be done in hive-site.xml, what additional jars are required, etc.? I have a Spark app that can

RE: tableau spark sql cassandra

2014-11-20 Thread Mohammed Guller
Hi Jerome, This is cool. It would be great if you could share more details about you got your setup to work finally. For example, what additional libraries/jars you are using. How are you configuring the ThriftServer to use the additional jars to communicate with Cassandra? In addition, how

RE: tableau spark sql cassandra

2014-11-21 Thread Mohammed Guller
Thanks, Jerome. BTW, have you tried the CalliopeServer2 from tuplejump? I was able to quickly connect from beeline/Squirrel to my Cassandra cluster using CalliopeServer2, which extends Spark SQL Thrift Server. It was very straight forward. Next step is to connect from Tableau, but I can't find

RE: Spark SQL parser bug?

2014-11-25 Thread Mohammed Guller
Leon, I solved the problem by creating a work around for it, so didn't have a need to upgrade to 1.1.2-SNAPSHOT. Mohammed -Original Message- From: Leon [mailto:pachku...@gmail.com] Sent: Tuesday, November 25, 2014 11:36 AM To: u...@spark.incubator.apache.org Subject: RE: Spark SQL

RE: Creating a front-end for output from Spark/PySpark

2014-11-25 Thread Mohammed Guller
Two options that I can think of: 1) Use the Spark SQL Thrift/JDBC server. 2) Develop a web app using some framework such as Play and expose a set of REST APIs for sending queries. Inside your web app backend, you initialize the Spark SQL context only once when your app initializes.

RE: querying data from Cassandra through the Spark SQL Thrift JDBC server

2014-11-25 Thread Mohammed Guller
AM To: Mohammed Guller; u...@spark.incubator.apache.org Subject: Re: querying data from Cassandra through the Spark SQL Thrift JDBC server This thread might be helpful http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html On 11/20/14 4:11 AM, Mohammed Guller

RE: Calling spark from a java web application.

2014-12-02 Thread Mohammed Guller
Jamal, I have not tried this, but can you not integrate Spark SQL with your Spring Java web app just like a standalone app? I have integrated a Scala web app (using Play) with Spark SQL and it works. Mohammed From: adrian [mailto:adria...@gmail.com] Sent: Friday, November 28, 2014 11:03 AM To:

Fair scheduling accross applications in stand-alone mode

2014-12-05 Thread Mohammed Guller
Hi - I understand that one can use spark.deploy.defaultCores and spark.cores.max to assign a fixed number of worker cores to different apps. However, instead of statically assigning the cores, I would like Spark to dynamically assign the cores to multiple apps. For example, when there is a

RE: Fair scheduling accross applications in stand-alone mode

2014-12-08 Thread Mohammed Guller
Hi - Does anybody have any ideas how to dynamically allocate cores instead of statically partitioning them among multiple applications? Thanks. Mohammed From: Mohammed Guller Sent: Friday, December 5, 2014 11:26 PM To: user@spark.apache.org Subject: Fair scheduling accross applications

RE: equivalent to sql in

2014-12-09 Thread Mohammed Guller
Option 1: dataRDD.filter(x=(x._2 ==apple) || (x._2 ==orange)) Option 2: val fruits = Set(apple, orange, pear) dataRDD.filter(x=fruits.contains(x._2)) Mohammed -Original Message- From: dizzy5112 [mailto:dave.zee...@gmail.com] Sent: Tuesday, December 9, 2014 2:16 PM To:

RE: Sort based shuffle not working properly?

2015-02-03 Thread Mohammed Guller
Nitin, Suing Spark is not going to help. Perhaps you should sue someone else :-) Just kidding! Mohammed -Original Message- From: nitinkak001 [mailto:nitinkak...@gmail.com] Sent: Tuesday, February 3, 2015 1:57 PM To: user@spark.apache.org Subject: Re: Sort based shuffle not working

RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Mohammed Guller
I don’t think it will work without HDFS. Mohammed From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent: Tuesday, January 20, 2015 7:55 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: RE: Can I save RDD to local file system and then read it back on spark

RE: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-19 Thread Mohammed Guller
-applications.html Hope this help! Kelvin On Thu, Feb 19, 2015 at 7:24 PM, Mohammed Guller moham...@glassbeam.commailto:moham...@glassbeam.com wrote: Hi – I am trying to use BoneCP (a database connection pooling library) to write data from my Spark application to an RDBMS. The database inserts

RE: How do you get the partitioner for an RDD in Java?

2015-02-17 Thread Mohammed Guller
Where did you look? BTW, it is defined in the RDD class as a val: val partitioner: Option[Partitioner] Mohammed -Original Message- From: Darin McBeath [mailto:ddmcbe...@yahoo.com.INVALID] Sent: Tuesday, February 17, 2015 1:45 PM To: User Subject: How do you get the partitioner for

RE: Running a script on scala-shell on Spark Standalone Cluster

2015-01-27 Thread Mohammed Guller
Looks like the culprit is this error: FileNotFoundException: File file:/home/sparkuser/spark-1.2.0/spark-1.2.0-bin-hadoop2.4/data/cut/ratings.txt does not exist Mohammed -Original Message- From: riginos [mailto:samarasrigi...@gmail.com] Sent: Tuesday, January 27, 2015 4:24 PM To:

using a database connection pool to write data into an RDBMS from a Spark application

2015-02-19 Thread Mohammed Guller
Hi – I am trying to use BoneCP (a database connection pooling library) to write data from my Spark application to an RDBMS. The database inserts are inside a foreachPartition code block. I am getting this exception when the code tries to insert data using BoneCP: java.sql.SQLException: No

RE: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-20 Thread Mohammed Guller
in it, and it's really on your classpath. On Fri, Feb 20, 2015 at 5:27 AM, Mohammed Guller moham...@glassbeam.com wrote: Hi Kelvin, Yes. I am creating an uber jar with the Postgres driver included, but nevertheless tried both –jars and –driver-classpath flags. It didn’t help. Interestingly, I

RE: Setting the number of executors in standalone mode

2015-02-20 Thread Mohammed Guller
SPARK_WORKER_MEMORY=8g Will allocate 8GB memory to Spark on each worker node. Nothing to do with # of executors. Mohammed From: Yiannis Gkoufas [mailto:johngou...@gmail.com] Sent: Friday, February 20, 2015 4:55 AM To: user@spark.apache.org Subject: Setting the number of executors in standalone

RE: Setting the number of executors in standalone mode

2015-02-20 Thread Mohammed Guller
ASFAIK, in stand-alone mode, each Spark application gets one executor on each worker. You could run multiple workers on a machine though. Mohammed From: Yiannis Gkoufas [mailto:johngou...@gmail.com] Sent: Friday, February 20, 2015 9:48 AM To: Mohammed Guller Cc: user@spark.apache.org Subject

RE: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-20 Thread Mohammed Guller
To: Mohammed Guller Cc: Kelvin Chu; user@spark.apache.org Subject: Re: using a database connection pool to write data into an RDBMS from a Spark application Hm, others can correct me if I'm wrong, but is this what SPARK_CLASSPATH is for? On Fri, Feb 20, 2015 at 6:04 PM, Mohammed Guller moham

RE: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-20 Thread Mohammed Guller
...@cloudera.com] Sent: Friday, February 20, 2015 9:42 AM To: Mohammed Guller Cc: Kelvin Chu; user@spark.apache.org Subject: Re: using a database connection pool to write data into an RDBMS from a Spark application Have a look at spark.yarn.user.classpath.first and spark.files.userClassPathFirst

RE: unknown issue in submitting a spark job

2015-01-29 Thread Mohammed Guller
Looks like the application is using a lot more memory than available. Could be a bug somewhere in the code or just underpowered machine. Hard to say without looking at the code. Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded Mohammed -Original Message- From:

RE: spark challenge: zip with next???

2015-01-29 Thread Mohammed Guller
Another solution would be to use the reduce action. Mohammed From: Ganelin, Ilya [mailto:ilya.gane...@capitalone.com] Sent: Thursday, January 29, 2015 1:32 PM To: 'derrickburns'; 'user@spark.apache.org' Subject: RE: spark challenge: zip with next??? Make a copy of your RDD with an extra entry

RE: unknown issue in submitting a spark job

2015-01-29 Thread Mohammed Guller
How much memory are you assigning to the Spark executor on the worker node? Mohammed From: ey-chih chow [mailto:eyc...@hotmail.com] Sent: Thursday, January 29, 2015 3:35 PM To: Mohammed Guller; user@spark.apache.org Subject: RE: unknown issue in submitting a spark job The worker node has 15G

RE: ArrayBuffer within a DataFrame

2015-04-02 Thread Mohammed Guller
Hint: DF.rdd.map{} Mohammed From: Denny Lee [mailto:denny.g@gmail.com] Sent: Thursday, April 2, 2015 7:10 PM To: user@spark.apache.org Subject: ArrayBuffer within a DataFrame Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Mohammed Guller
: user@spark.apache.org; Mohammed Guller Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra Hi Todd, Thanks for the link. I would be interested in this solution. I am using DSE for cassandra. Would you provide me with info on connecting with DSE either through Tableau or zeppelin

RE: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-28 Thread Mohammed Guller
One reason Spark on disk is faster than MapReduce is Spark’s advanced Directed Acyclic Graph (DAG) engine. MapReduce will require a complex job to be split into multiple Map-Reduce jobs, with disk I/O at the end of each job and beginning of a new job. With Spark, you may be able to express the

RE: Exiting driver main() method...

2015-05-02 Thread Mohammed Guller
No, you don’t need to do anything special. Perhaps, your application is getting stuck somewhere? If you can share your code, someone may be able to help. Mohammed From: James Carman [mailto:ja...@carmanconsulting.com] Sent: Friday, May 1, 2015 5:53 AM To: user@spark.apache.org Subject: Exiting

RE: Spark JVM default memory

2015-05-04 Thread Mohammed Guller
Did you confirm through the Spark UI how much memory is getting allocated to your application on each worker? Mohammed From: Vijayasarathy Kannan [mailto:kvi...@vt.edu] Sent: Monday, May 4, 2015 3:36 PM To: Andrew Ash Cc: user@spark.apache.org Subject: Re: Spark JVM default memory I am trying

RE: Lambda architecture using Apache Spark

2015-05-08 Thread Mohammed Guller
Why are you not using Cassandra for storing the pre-computed views? Mohammed -Original Message- From: rafac [mailto:rafaelme...@hotmail.com] Sent: Friday, May 8, 2015 1:48 PM To: user@spark.apache.org Subject: Lambda architecture using Apache Spark I am implementing the lambda

RE: Spark streaming updating a large window more frequently

2015-05-08 Thread Mohammed Guller
If I understand you correctly, you need Window duration of 1 hour and sliding interval of 5 seconds. Mohammed -Original Message- From: Ankur Chauhan [mailto:achau...@brightcove.com] Sent: Friday, May 8, 2015 2:27 PM To: u...@spark.incubator.apache.org Subject: Spark streaming

RE: HiveThriftServer2

2015-04-11 Thread Mohammed Guller
Thanks, Cheng. BTW, there is another thread on the same topic. It looks like the thrift-server will be published for 1.3.1. Mohammed From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Saturday, April 11, 2015 5:37 AM To: Mohammed Guller; user@spark.apache.org Subject: Re: HiveThriftServer2

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-06 Thread Mohammed Guller
Sure, will do. I may not be able to get to it until next week, but will let you know if I am able to the crack the code. Mohammed From: Todd Nist [mailto:tsind...@gmail.com] Sent: Friday, April 3, 2015 5:52 PM To: Mohammed Guller Cc: pawan kumar; user@spark.apache.org Subject: Re: Tableau

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Mohammed Guller
: Todd Nist [mailto:tsind...@gmail.com] Sent: Friday, April 3, 2015 11:39 AM To: pawan kumar Cc: Mohammed Guller; user@spark.apache.org Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra Hi Mohammed, Not sure if you have tried this or not. You could try using the below api to start

HiveThriftServer2

2015-04-07 Thread Mohammed Guller
Hi - I want to create an instance of HiveThriftServer2 in my Scala application, so I imported the following line: import org.apache.spark.sql.hive.thriftserver._ However, when I compile the code, I get the following error: object thriftserver is not a member of package

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller
+1 Interestingly, I ran into the exactly the same issue yesterday. I couldn’t find any documentation about which project to include as a dependency in build.sbt to use HiveThriftServer2. Would appreciate help. Mohammed From: Todd Nist [mailto:tsind...@gmail.com] Sent: Wednesday, April 8,

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller
...@databricks.com] Sent: Wednesday, April 8, 2015 11:54 AM To: Mohammed Guller Cc: Todd Nist; James Aley; user; Patrick Wendell Subject: Re: Advice using Spark SQL and Thrift JDBC Server Sorry guys. I didn't realize that https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet. You can publish

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller
: Wednesday, April 8, 2015 6:16 PM To: Todd Nist Cc: Mohammed Guller; Michael Armbrust; James Aley; user Subject: Re: Advice using Spark SQL and Thrift JDBC Server Hey Guys, Someone submitted a patch for this just now. It's a very simple fix and we can merge it soon. However, it's just missed our

RE: Make HTTP requests from within Spark

2015-06-03 Thread Mohammed Guller
The short answer is yes. How you do it depends on a number of factors. Assuming you want to build an RDD from the responses and then analyze the responses using Spark core (not Spark Streaming), here is one simple way to do it: 1) Implement a class or function that connects to a web service and

RE: Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-06-04 Thread Mohammed Guller
I am considering DSE, which has integrated Spark SQL Thrift/JDBC server with Cassandra. Mohammed From: Deenar Toraskar [mailto:deenar.toras...@gmail.com] Sent: Thursday, June 4, 2015 7:42 AM To: Mohammed Guller Cc: user@spark.apache.org Subject: Re: Anybody using Spark SQL JDBC server with DSE

RE: Cassandra Submit

2015-06-05 Thread Mohammed Guller
Check your spark.cassandra.connection.host setting. It should be pointing to one of your Cassandra nodes. Mohammed From: Yasemin Kaya [mailto:godo...@gmail.com] Sent: Friday, June 5, 2015 7:31 AM To: user@spark.apache.org Subject: Cassandra Submit Hi, I am using cassandraDB in my project. I

RE: Cassandra Submit

2015-06-09 Thread Mohammed Guller
-cassandra-2.1.5$ bin/cassandra-cli -h 127.0.0.1 -p 9160 Mohammed From: Yasemin Kaya [mailto:godo...@gmail.com] Sent: Tuesday, June 9, 2015 11:32 AM To: Yana Kadiyska Cc: Gerard Maas; Mohammed Guller; user@spark.apache.org Subject: Re: Cassandra Submit I removed core and streaming jar

RE: Cassandra Submit

2015-06-09 Thread Mohammed Guller
jar has the wrong version of the library that SCC is trying to use. Welcome to jar hell! Mohammed From: Yasemin Kaya [mailto:godo...@gmail.com] Sent: Tuesday, June 9, 2015 12:24 PM To: Mohammed Guller Cc: Yana Kadiyska; Gerard Maas; user@spark.apache.org Subject: Re: Cassandra Submit My code

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Mohammed Guller
I haven’t tried using Zeppelin with Spark on Cassandra, so can’t say for sure, but it should not be difficult. Mohammed From: Matthew Johnson [mailto:matt.john...@algomi.com] Sent: Monday, June 22, 2015 2:15 AM To: Mohammed Guller; shahid ashraf Cc: user@spark.apache.org Subject: RE: Code

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-19 Thread Mohammed Guller
Hi Matthew, It looks fine to me. I have built a similar service that allows a user to submit a query from a browser and returns the result in JSON format. Another alternative is to leave a Spark shell or one of the notebooks (Spark Notebook, Zeppelin, etc.) session open and run queries from

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-20 Thread Mohammed Guller
:52 AM To: Mohammed Guller Cc: Matthew Johnson; user@spark.apache.org Subject: RE: Code review - Spark SQL command-line client for Cassandra Hi Mohammad Can you provide more info about the Service u developed On Jun 20, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.commailto:moham

RE: Need some Cassandra integration help

2015-06-01 Thread Mohammed Guller
Hi Yana, Not sure whether you already solved this issue. As far as I know, the DataFrame support in Spark Cassandra connector was added in version 1.3. The first milestone release of SCC v1.3 was just announced. Mohammed From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Tuesday, May

RE: Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-06-01 Thread Mohammed Guller
Nobody using Spark SQL JDBC/Thrift server with DSE Cassandra? Mohammed From: Mohammed Guller [mailto:moham...@glassbeam.com] Sent: Friday, May 29, 2015 11:49 AM To: user@spark.apache.org Subject: Anybody using Spark SQL JDBC server with DSE Cassandra? Hi - We have successfully integrated Spark

RE: Migrate Relational to Distributed

2015-06-01 Thread Mohammed Guller
Brant, You should be able to migrate most of your existing SQL code to Spark SQL, but remember that Spark SQL does not yet support the full ANSI standard. So you may need to rewrite some of your existing queries. Another thing to keep in mind is that Spark SQL is not real-time. The response

Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-05-29 Thread Mohammed Guller
Hi - We have successfully integrated Spark SQL with Cassandra. We have a backend that provides a REST API that allows users to execute SQL queries on data in C*. Now we would like to also support JDBC/ODBC connectivity , so that user can use tools like Tableau to query data in C* through the

RE: making dataframe for different types using spark-csv

2015-07-01 Thread Mohammed Guller
Another option is to provide the schema to the load method. One variant of the sqlContext.load takes a schema as a input parameter. You can define the schema programmatically as shown here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

RE: How to create a LabeledPoint RDD from a Data Frame

2015-07-06 Thread Mohammed Guller
Have you looked at the new Spark ML library? You can use a DataFrame directly with the Spark ML API. https://spark.apache.org/docs/latest/ml-guide.html Mohammed From: Sourav Mazumder [mailto:sourav.mazumde...@gmail.com] Sent: Monday, July 6, 2015 10:29 AM To: user Subject: How to create a

RE: How do we control output part files created by Spark job?

2015-07-06 Thread Mohammed Guller
You could repartition the dataframe before saving it. However, that would impact the parallelism of the next jobs that reads these file from HDFS. Mohammed -Original Message- From: kachau [mailto:umesh.ka...@gmail.com] Sent: Monday, July 6, 2015 10:23 AM To: user@spark.apache.org

RE: Spark SQL queries hive table, real time ?

2015-07-06 Thread Mohammed Guller
Hi Florian, It depends on a number of factors. How much data are you querying? Where is the data stored (HDD, SSD or DRAM)? What is the file format (Parquet or CSV)? In theory, it is possible to use Spark SQL for real-time queries, but cost increases as the data size grows. If you can store all

RE: Spark application with a RESTful API

2015-07-06 Thread Mohammed Guller
It is not a bad idea. Many people use this approach. Mohammed -Original Message- From: Sagi r [mailto:stsa...@gmail.com] Sent: Monday, July 6, 2015 1:58 PM To: user@spark.apache.org Subject: Spark application with a RESTful API Hi, I've been researching spark for a couple of months

RE: Heatmap with Spark Streaming

2015-07-30 Thread Mohammed Guller
Umesh, You can create a web-service in any of the languages supported by Spark and stream the result from this web-service to your D3-based client using Websocket or Server-Sent Events. For example, you can create a webservice using Play. This app will integrate with Spark streaming in the

RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
One options is to use the coalesce method in the RDD class. Mohammed From: Brandon White [mailto:bwwintheho...@gmail.com] Sent: Tuesday, August 4, 2015 7:23 PM To: user Subject: Combining Spark Files with saveAsTextFile What is the best way to make saveAsTextFile save as only a single file?

RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
Just to further clarify, you can first call coalesce with argument 1 and then call saveAsTextFile. For example, rdd.coalesce(1).saveAsTextFile(...) Mohammed From: Mohammed Guller Sent: Tuesday, August 4, 2015 9:39 PM To: 'Brandon White'; user Subject: RE: Combining Spark Files

Spark SQL unable to recognize schema name

2015-08-04 Thread Mohammed Guller
Hi - I am running the Thrift JDBC/ODBC server (v1.4.1) and encountered a problem when querying tables using fully qualified table names(schemaName.tableName). The following query works fine from the beeline tool: SELECT * from test; However, the following query throws an exception, even

RE: Need help in SparkSQL

2015-07-22 Thread Mohammed Guller
Parquet Mohammed From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Wednesday, July 22, 2015 5:48 AM To: user Subject: Need help in SparkSQL HI All, I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries

RE: Any beginner samples for using ML / MLIB to produce a moving average of a (K, iterable[V])

2015-07-15 Thread Mohammed Guller
I could be wrong, but it looks like the only implementation available right now is MultivariateOnlineSummarizer. Mohammed From: Nkechi Achara [mailto:nkach...@googlemail.com] Sent: Wednesday, July 15, 2015 4:31 AM To: user@spark.apache.org Subject: Any beginner samples for using ML / MLIB to

RE: Feature Generation On Spark

2015-07-18 Thread Mohammed Guller
[mailto:rishikeshtha...@hotmail.com] Sent: Friday, July 17, 2015 12:33 AM To: Mohammed Guller Subject: Re: Feature Generation On Spark Thanks I did look at the example. I am using Spark 1.2. The modules mentioned there are not in 1.2 I guess. The import is failing Rishi From

RE: Spark performance

2015-07-13 Thread Mohammed Guller
. Mohammed From: Michael Segel [mailto:msegel_had...@hotmail.com] Sent: Sunday, July 12, 2015 6:59 AM To: Mohammed Guller Cc: David Mitchell; Roman Sokolov; user; Ravisankar Mani Subject: Re: Spark performance Not necessarily. It depends on the use case and what you intend to do with the data. 4-6

RE: Kmeans Labeled Point RDD

2015-07-20 Thread Mohammed Guller
I responded to your question on SO. Let me know if this what you wanted. http://stackoverflow.com/a/31528274/2336943 Mohammed -Original Message- From: plazaster [mailto:michaelplaz...@gmail.com] Sent: Sunday, July 19, 2015 11:38 PM To: user@spark.apache.org Subject: Re: Kmeans

RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
Michael, How would the Catalyst optimizer optimize this version? df.filter(df(filter_field) === value).select(field1).show() Would it still read all the columns in df or would it read only “filter_field” and “field1” since only two columns are used (assuming other columns from df are not used

RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
Thanks, Harish. Mike – this would be a cleaner version for your use case: df.filter(df(filter_field) === value).select(field1).show() Mohammed From: Harish Butani [mailto:rhbutani.sp...@gmail.com] Sent: Monday, July 20, 2015 5:37 PM To: Mohammed Guller Cc: Michael Armbrust; Mike Trienis; user

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread Mohammed Guller
Short answer: yes. The Spark Cassandra Connector supports the data source API. So you can create a DataFrame that points directly to a Cassandra table. You can query it using the DataFrame API or the SQL/HiveQL interface. If you want to see an example, see slide# 27 and 28 in this deck that I

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Mohammed Guller
Did you mean Hive or Spark SQL JDBC/ODBC server? Mohammed From: Bryan Jeffrey [mailto:bryan.jeff...@gmail.com] Sent: Thursday, November 12, 2015 9:12 AM To: Mohammed Guller Cc: user Subject: Re: Cassandra via SparkSQL/Hive JDBC Mohammed, That is great. It looks like a perfect scenario. Would

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Mohammed Guller
to manually SET it for each Beeline session. Mohammed From: Bryan Jeffrey [mailto:bryan.jeff...@gmail.com] Sent: Thursday, November 12, 2015 10:26 AM To: Mohammed Guller Cc: user Subject: Re: Cassandra via SparkSQL/Hive JDBC Answer: In beeline run the following: SET spark.cassandra.connection.host

RE: Spark SQL Thriftserver and Hive UDF in Production

2015-10-18 Thread Mohammed Guller
Have you tried registering the function using the Beeline client? Another alternative would be to create a Spark SQL UDF and launch the Spark SQL Thrift server programmatically. Mohammed -Original Message- From: ReeceRobinson [mailto:re...@therobinsons.gen.nz] Sent: Sunday, October

RE: dataframes and numPartitions

2015-10-15 Thread Mohammed Guller
You may find the spark.sql.shuffle.partitions property useful. The default value is 200. Mohammed From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com] Sent: Wednesday, October 14, 2015 8:14 PM To: user Subject: dataframes and numPartitions A lot of RDD methods take a numPartitions

RE: laziness in textFile reading from HDFS?

2015-10-06 Thread Mohammed Guller
operation and then a save operation, I don't see how caching would help. Mohammed -Original Message- From: Matt Narrell [mailto:matt.narr...@gmail.com] Sent: Tuesday, October 6, 2015 3:32 PM To: Mohammed Guller Cc: davidkl; user@spark.apache.org Subject: Re: laziness in textFile

RE: laziness in textFile reading from HDFS?

2015-10-06 Thread Mohammed Guller
-hadoop-throws-exception-for-large-lzo-files Mohammed -Original Message- From: Matt Narrell [mailto:matt.narr...@gmail.com] Sent: Tuesday, October 6, 2015 4:08 PM To: Mohammed Guller Cc: davidkl; user@spark.apache.org Subject: Re: laziness in textFile reading from HDFS? Agreed. This is spark

RE: Spark performance

2015-07-10 Thread Mohammed Guller
Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired with a storage system. Seconds, they are designed for processing large distributed datasets. If you have only 100,000 records or even a million records, you don’t need Spark. A RDBMS

RE: Spark performance

2015-07-11 Thread Mohammed Guller
To: Roman Sokolov Cc: Mohammed Guller; user; Ravisankar Mani Subject: Re: Spark performance You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return

RE: Feature Generation On Spark

2015-07-09 Thread Mohammed Guller
Take a look at the examples here: https://spark.apache.org/docs/latest/ml-guide.html Mohammed From: rishikesh thakur [mailto:rishikeshtha...@hotmail.com] Sent: Saturday, July 4, 2015 10:49 PM To: ayan guha; Michal Čizmazia Cc: user Subject: RE: Feature Generation On Spark I have one document

RE: Spark thrift service and Hive impersonation.

2015-09-29 Thread Mohammed Guller
Jagat Singh [mailto:jagatsi...@gmail.com] Sent: Tuesday, September 29, 2015 6:32 PM To: Mohammed Guller Cc: SparkUser Subject: Re: Spark thrift service and Hive impersonation. Hi, Thanks for your reply. If you see the log message Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to

RE: laziness in textFile reading from HDFS?

2015-09-29 Thread Mohammed Guller
1) It is not required to have the same amount of memory as data. 2) By default the # of partitions are equal to the number of HDFS blocks 3) Yes, the read operation is lazy 4) It is okay to have more number of partitions than number of cores. Mohammed -Original Message- From: davidkl

RE: Spark thrift service and Hive impersonation.

2015-09-29 Thread Mohammed Guller
Does each user needs to start own thrift server to use it? No. One of the benefits of the Spark Thrift Server is that it allows multiple users to share a single SparkContext. Most likely, you have file permissions issue. Mohammed From: Jagat Singh [mailto:jagatsi...@gmail.com] Sent: Tuesday,

RE: laziness in textFile reading from HDFS?

2015-10-05 Thread Mohammed Guller
Is there any specific reason for caching the RDD? How many passes you make over the dataset? Mohammed -Original Message- From: Matt Narrell [mailto:matt.narr...@gmail.com] Sent: Saturday, October 3, 2015 9:50 PM To: Mohammed Guller Cc: davidkl; user@spark.apache.org Subject: Re

  1   2   >