I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3
worker). Our app is fetching data from Cassandra and doing a basic filter, map,
and countByKey on that data. I have run into a strange problem. Even if the
number of rows in Cassandra is just 1M, the Spark job goes seems
with the Datastax spark driver?
Mohammed
-Original Message-
From: Martin Gammelsæter [mailto:martingammelsae...@gmail.com]
Sent: Friday, July 4, 2014 12:43 AM
To: user@spark.apache.org
Subject: Re: How to use groupByKey and CqlPagingInputFormat
On Thu, Jul 3, 2014 at 10:29 PM, Mohammed Guller
Hi -
When I run the following Spark SQL query in Spark-shell ( version 1.1.0) :
val rdd = sqlContext.sql(SELECT a FROM x WHERE ts = '2012-01-01T00:00:00' AND
ts = '2012-03-31T23:59:59' )
it gives the following error:
rdd: org.apache.spark.sql.SchemaRDD =
SchemaRDD[294] at RDD at
, 2014 4:37 AM
To: Mohammed Guller; user@spark.apache.org
Subject: Re: Spark SQL parser bug?
Hi Mohammed,
Would you mind to share the DDL of the table x and the complete stacktrace of
the exception you got? A full Spark shell session history would be more than
helpful. PR #2084 had been merged
[a#0,ts#1], MapPartitionsRDD[37] at mapPartitions at
basicOperators.scala:208
scala sRdd.collect
res10: Array[org.apache.spark.sql.Row] = Array()
Mohammed
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Friday, October 10, 2014 10:14 PM
To: Mohammed Guller; user@spark.apache.org
Subject
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Sunday, October 12, 2014 1:35 AM
To: Mohammed Guller; Cheng Lian; user@spark.apache.org
Subject: RE: Spark SQL parser bug?
Hi, I couldn’t reproduce the bug with the latest master branch. Which version
are you using? Can you also list data
Plan ==
Project [a#2]
ExistingRdd [a#2,ts#3], MapPartitionsRDD[22] at mapPartitions at
basicOperators.scala:208
scala s.collect
res5: Array[org.apache.spark.sql.Row] = Array()
Mohammed
From: Yin Huai [mailto:huaiyin@gmail.com]
Sent: Monday, October 13, 2014 7:19 AM
To: Mohammed Guller
Cc
That explains it. Thanks!
Mohammed
From: Yin Huai [mailto:huaiyin@gmail.com]
Sent: Monday, October 13, 2014 8:47 AM
To: Mohammed Guller
Cc: Cheng, Hao; Cheng Lian; user@spark.apache.org
Subject: Re: Spark SQL parser bug?
Yeah, it is not related to timezone. I think you hit this
issuehttps
Hi -
Has anybody figured out how to integrate a Play application with Spark and run
it on a Spark cluster using spark-submit script? I have seen some blogs about
creating a simple Play app and running it locally on a dev machine with sbt run
command. However, those steps don't work for
that piece of code? Also is there any
specific reason why you are not using play dist instead?
Mohammed
From: US Office Admin [mailto:ad...@vectorum.com]
Sent: Thursday, October 16, 2014 11:41 AM
To: Surendranauth Hiraman; Mohammed Guller
Cc: Daniel Siegmann; user@spark.apache.org
Subject: Re
To: Mohammed Guller
Cc: US Office Admin; Surendranauth Hiraman; Daniel Siegmann;
user@spark.apache.org
Subject: Re: Play framework
Hi,
Below is the link for a simple Play + SparkSQL example -
http://blog.knoldus.com/2014/07/14/play-with-spark-building-apache-spark-with-play-framework-part-3
What about all the play dependencies since the jar created by the ‘Play
package’ won’t include the play jar or any of the 100+ jars on which play
itself depends?
Mohammed
From: US Office Admin [mailto:ad...@vectorum.com]
Sent: Thursday, October 16, 2014 7:05 PM
To: Mohammed Guller
To: Mohammed Guller
Cc: US Office Admin; Surendranauth Hiraman; Daniel Siegmann;
user@spark.apache.org
Subject: Re: Play framework
In our case, Play libraries are not required to run spark jobs. Hence they are
available only on master and play runs as a regular scala application. I can't
think
Try a version built with Akka 2.2.x
Mohammed
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Tuesday, October 28, 2014 3:03 AM
To: user
Subject: Spray client reports Exception:
akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext
Hi,
I got the following exceptions
Hi -
The Spark SQL Row class has methods such as getInt, getLong, getBoolean,
getFloat, getDouble, etc. However, I don't see a getDate method. So how can one
retrieve a date/timestamp type column from a result set?
Thanks,
Mohammed
:23 PM
To: Zhan Zhang
Cc: Mohammed Guller; user@spark.apache.org
Subject: Re: how to retrieve the value of a column of type date/timestamp from
a Spark SQL Row
Or def getAs[T](i: Int): T
Best Regards,
Shixiong Zhu
2014-10-29 13:16 GMT+08:00 Zhan Zhang
zzh...@hortonworks.commailto:zzh
I am not sure about that.
Can you try a Spray version built with 2.2.x along with Spark 1.1 and include
the Akka dependencies in your project’s sbt file?
Mohammed
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Tuesday, October 28, 2014 8:58 PM
To: Mohammed Guller
Cc: user
Subject
Actually, it is possible to integrate Spark 1.1.0 with Play 2.2.x
Here is a sample build.sbt file:
name := xyz
version := 0.1
scalaVersion := 2.10.4
libraryDependencies ++= Seq(
jdbc,
anorm,
cache,
org.apache.spark %% spark-core % 1.1.0,
com.typesafe.akka %% akka-actor % 2.2.3,
David,
Here is what I would suggest:
1 - Does a new SparkContext get created in the web tier for each new request
for processing?
Create a single SparkContext that gets shared across multiple web requests.
Depending on the framework that you are using for the web-tier, it should not
be
Meehan [mailto:jnmee...@gmail.com]
Sent: Tuesday, November 11, 2014 11:35 PM
To: Mohammed Guller
Cc: Patrick Wendell; Akshat Aranya; user@spark.apache.org
Subject: Re: Spark and Play
You can also build a Play 2.2.x + Spark 1.1.0 fat jar with sbt-assembly for,
e.g. yarn-client support or using
Hi - I was curious if anyone is using the Spark SQL Thrift JDBC server with
Cassandra. It would be great be if you could share how you got it working? For
example, what config changes have to be done in hive-site.xml, what additional
jars are required, etc.?
I have a Spark app that can
Hi Jerome,
This is cool. It would be great if you could share more details about you got
your setup to work finally. For example, what additional libraries/jars you are
using. How are you configuring the ThriftServer to use the additional jars to
communicate with Cassandra?
In addition, how
Thanks, Jerome.
BTW, have you tried the CalliopeServer2 from tuplejump? I was able to quickly
connect from beeline/Squirrel to my Cassandra cluster using CalliopeServer2,
which extends Spark SQL Thrift Server. It was very straight forward.
Next step is to connect from Tableau, but I can't find
Leon,
I solved the problem by creating a work around for it, so didn't have a need to
upgrade to 1.1.2-SNAPSHOT.
Mohammed
-Original Message-
From: Leon [mailto:pachku...@gmail.com]
Sent: Tuesday, November 25, 2014 11:36 AM
To: u...@spark.incubator.apache.org
Subject: RE: Spark SQL
Two options that I can think of:
1) Use the Spark SQL Thrift/JDBC server.
2) Develop a web app using some framework such as Play and expose a set of
REST APIs for sending queries. Inside your web app backend, you initialize the
Spark SQL context only once when your app initializes.
AM
To: Mohammed Guller; u...@spark.incubator.apache.org
Subject: Re: querying data from Cassandra through the Spark SQL Thrift JDBC
server
This thread might be helpful
http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html
On 11/20/14 4:11 AM, Mohammed Guller
Jamal,
I have not tried this, but can you not integrate Spark SQL with your Spring
Java web app just like a standalone app? I have integrated a Scala web app
(using Play) with Spark SQL and it works.
Mohammed
From: adrian [mailto:adria...@gmail.com]
Sent: Friday, November 28, 2014 11:03 AM
To:
Hi -
I understand that one can use spark.deploy.defaultCores and spark.cores.max
to assign a fixed number of worker cores to different apps. However, instead of
statically assigning the cores, I would like Spark to dynamically assign the
cores to multiple apps. For example, when there is a
Hi -
Does anybody have any ideas how to dynamically allocate cores instead of
statically partitioning them among multiple applications? Thanks.
Mohammed
From: Mohammed Guller
Sent: Friday, December 5, 2014 11:26 PM
To: user@spark.apache.org
Subject: Fair scheduling accross applications
Option 1:
dataRDD.filter(x=(x._2 ==apple) || (x._2 ==orange))
Option 2:
val fruits = Set(apple, orange, pear)
dataRDD.filter(x=fruits.contains(x._2))
Mohammed
-Original Message-
From: dizzy5112 [mailto:dave.zee...@gmail.com]
Sent: Tuesday, December 9, 2014 2:16 PM
To:
Nitin,
Suing Spark is not going to help. Perhaps you should sue someone else :-) Just
kidding!
Mohammed
-Original Message-
From: nitinkak001 [mailto:nitinkak...@gmail.com]
Sent: Tuesday, February 3, 2015 1:57 PM
To: user@spark.apache.org
Subject: Re: Sort based shuffle not working
I don’t think it will work without HDFS.
Mohammed
From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
Sent: Tuesday, January 20, 2015 7:55 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: RE: Can I save RDD to local file system and then read it back on spark
-applications.html
Hope this help!
Kelvin
On Thu, Feb 19, 2015 at 7:24 PM, Mohammed Guller
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Hi –
I am trying to use BoneCP (a database connection pooling library) to write data
from my Spark application to an RDBMS. The database inserts
Where did you look?
BTW, it is defined in the RDD class as a val:
val partitioner: Option[Partitioner]
Mohammed
-Original Message-
From: Darin McBeath [mailto:ddmcbe...@yahoo.com.INVALID]
Sent: Tuesday, February 17, 2015 1:45 PM
To: User
Subject: How do you get the partitioner for
Looks like the culprit is this error:
FileNotFoundException: File
file:/home/sparkuser/spark-1.2.0/spark-1.2.0-bin-hadoop2.4/data/cut/ratings.txt
does not exist
Mohammed
-Original Message-
From: riginos [mailto:samarasrigi...@gmail.com]
Sent: Tuesday, January 27, 2015 4:24 PM
To:
Hi –
I am trying to use BoneCP (a database connection pooling library) to write data
from my Spark application to an RDBMS. The database inserts are inside a
foreachPartition code block. I am getting this exception when the code tries to
insert data using BoneCP:
java.sql.SQLException: No
in it, and it's really on your classpath.
On Fri, Feb 20, 2015 at 5:27 AM, Mohammed Guller moham...@glassbeam.com wrote:
Hi Kelvin,
Yes. I am creating an uber jar with the Postgres driver included, but
nevertheless tried both –jars and –driver-classpath flags. It didn’t help.
Interestingly, I
SPARK_WORKER_MEMORY=8g
Will allocate 8GB memory to Spark on each worker node. Nothing to do with # of
executors.
Mohammed
From: Yiannis Gkoufas [mailto:johngou...@gmail.com]
Sent: Friday, February 20, 2015 4:55 AM
To: user@spark.apache.org
Subject: Setting the number of executors in standalone
ASFAIK, in stand-alone mode, each Spark application gets one executor on each
worker. You could run multiple workers on a machine though.
Mohammed
From: Yiannis Gkoufas [mailto:johngou...@gmail.com]
Sent: Friday, February 20, 2015 9:48 AM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject
To: Mohammed Guller
Cc: Kelvin Chu; user@spark.apache.org
Subject: Re: using a database connection pool to write data into an RDBMS from
a Spark application
Hm, others can correct me if I'm wrong, but is this what SPARK_CLASSPATH is for?
On Fri, Feb 20, 2015 at 6:04 PM, Mohammed Guller moham
...@cloudera.com]
Sent: Friday, February 20, 2015 9:42 AM
To: Mohammed Guller
Cc: Kelvin Chu; user@spark.apache.org
Subject: Re: using a database connection pool to write data into an RDBMS from
a Spark application
Have a look at spark.yarn.user.classpath.first and
spark.files.userClassPathFirst
Looks like the application is using a lot more memory than available. Could be
a bug somewhere in the code or just underpowered machine. Hard to say without
looking at the code.
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Mohammed
-Original Message-
From:
Another solution would be to use the reduce action.
Mohammed
From: Ganelin, Ilya [mailto:ilya.gane...@capitalone.com]
Sent: Thursday, January 29, 2015 1:32 PM
To: 'derrickburns'; 'user@spark.apache.org'
Subject: RE: spark challenge: zip with next???
Make a copy of your RDD with an extra entry
How much memory are you assigning to the Spark executor on the worker node?
Mohammed
From: ey-chih chow [mailto:eyc...@hotmail.com]
Sent: Thursday, January 29, 2015 3:35 PM
To: Mohammed Guller; user@spark.apache.org
Subject: RE: unknown issue in submitting a spark job
The worker node has 15G
Hint:
DF.rdd.map{}
Mohammed
From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Thursday, April 2, 2015 7:10 PM
To: user@spark.apache.org
Subject: ArrayBuffer within a DataFrame
Quick question - the output of a dataframe is in the format of:
[2015-04, ArrayBuffer(A, B, C, D)]
and I'd
: user@spark.apache.org; Mohammed Guller
Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra
Hi Todd,
Thanks for the link. I would be interested in this solution. I am using DSE for
cassandra. Would you provide me with info on connecting with DSE either through
Tableau or zeppelin
One reason Spark on disk is faster than MapReduce is Spark’s advanced Directed
Acyclic Graph (DAG) engine. MapReduce will require a complex job to be split
into multiple Map-Reduce jobs, with disk I/O at the end of each job and
beginning of a new job. With Spark, you may be able to express the
No, you don’t need to do anything special. Perhaps, your application is getting
stuck somewhere? If you can share your code, someone may be able to help.
Mohammed
From: James Carman [mailto:ja...@carmanconsulting.com]
Sent: Friday, May 1, 2015 5:53 AM
To: user@spark.apache.org
Subject: Exiting
Did you confirm through the Spark UI how much memory is getting allocated to
your application on each worker?
Mohammed
From: Vijayasarathy Kannan [mailto:kvi...@vt.edu]
Sent: Monday, May 4, 2015 3:36 PM
To: Andrew Ash
Cc: user@spark.apache.org
Subject: Re: Spark JVM default memory
I am trying
Why are you not using Cassandra for storing the pre-computed views?
Mohammed
-Original Message-
From: rafac [mailto:rafaelme...@hotmail.com]
Sent: Friday, May 8, 2015 1:48 PM
To: user@spark.apache.org
Subject: Lambda architecture using Apache Spark
I am implementing the lambda
If I understand you correctly, you need Window duration of 1 hour and sliding
interval of 5 seconds.
Mohammed
-Original Message-
From: Ankur Chauhan [mailto:achau...@brightcove.com]
Sent: Friday, May 8, 2015 2:27 PM
To: u...@spark.incubator.apache.org
Subject: Spark streaming
Thanks, Cheng.
BTW, there is another thread on the same topic. It looks like the thrift-server
will be published for 1.3.1.
Mohammed
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Saturday, April 11, 2015 5:37 AM
To: Mohammed Guller; user@spark.apache.org
Subject: Re: HiveThriftServer2
Sure, will do. I may not be able to get to it until next week, but will let you
know if I am able to the crack the code.
Mohammed
From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Friday, April 3, 2015 5:52 PM
To: Mohammed Guller
Cc: pawan kumar; user@spark.apache.org
Subject: Re: Tableau
: Todd Nist [mailto:tsind...@gmail.com]
Sent: Friday, April 3, 2015 11:39 AM
To: pawan kumar
Cc: Mohammed Guller; user@spark.apache.org
Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra
Hi Mohammed,
Not sure if you have tried this or not. You could try using the below api to
start
Hi -
I want to create an instance of HiveThriftServer2 in my Scala application, so
I imported the following line:
import org.apache.spark.sql.hive.thriftserver._
However, when I compile the code, I get the following error:
object thriftserver is not a member of package
+1
Interestingly, I ran into the exactly the same issue yesterday. I couldn’t
find any documentation about which project to include as a dependency in
build.sbt to use HiveThriftServer2. Would appreciate help.
Mohammed
From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Wednesday, April 8,
...@databricks.com]
Sent: Wednesday, April 8, 2015 11:54 AM
To: Mohammed Guller
Cc: Todd Nist; James Aley; user; Patrick Wendell
Subject: Re: Advice using Spark SQL and Thrift JDBC Server
Sorry guys. I didn't realize that
https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet.
You can publish
: Wednesday, April 8, 2015 6:16 PM
To: Todd Nist
Cc: Mohammed Guller; Michael Armbrust; James Aley; user
Subject: Re: Advice using Spark SQL and Thrift JDBC Server
Hey Guys,
Someone submitted a patch for this just now. It's a very simple fix and we can
merge it soon. However, it's just missed our
The short answer is yes.
How you do it depends on a number of factors. Assuming you want to build an RDD
from the responses and then analyze the responses using Spark core (not Spark
Streaming), here is one simple way to do it:
1) Implement a class or function that connects to a web service and
I am considering DSE, which has integrated Spark SQL
Thrift/JDBC server with Cassandra.
Mohammed
From: Deenar Toraskar [mailto:deenar.toras...@gmail.com]
Sent: Thursday, June 4, 2015 7:42 AM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: Anybody using Spark SQL JDBC server with DSE
Check your spark.cassandra.connection.host setting. It should be pointing to
one of your Cassandra nodes.
Mohammed
From: Yasemin Kaya [mailto:godo...@gmail.com]
Sent: Friday, June 5, 2015 7:31 AM
To: user@spark.apache.org
Subject: Cassandra Submit
Hi,
I am using cassandraDB in my project. I
-cassandra-2.1.5$ bin/cassandra-cli -h 127.0.0.1 -p 9160
Mohammed
From: Yasemin Kaya [mailto:godo...@gmail.com]
Sent: Tuesday, June 9, 2015 11:32 AM
To: Yana Kadiyska
Cc: Gerard Maas; Mohammed Guller; user@spark.apache.org
Subject: Re: Cassandra Submit
I removed core and streaming jar
jar has the wrong version of the library that
SCC is trying to use. Welcome to jar hell!
Mohammed
From: Yasemin Kaya [mailto:godo...@gmail.com]
Sent: Tuesday, June 9, 2015 12:24 PM
To: Mohammed Guller
Cc: Yana Kadiyska; Gerard Maas; user@spark.apache.org
Subject: Re: Cassandra Submit
My code
I haven’t tried using Zeppelin with Spark on Cassandra, so can’t say for sure,
but it should not be difficult.
Mohammed
From: Matthew Johnson [mailto:matt.john...@algomi.com]
Sent: Monday, June 22, 2015 2:15 AM
To: Mohammed Guller; shahid ashraf
Cc: user@spark.apache.org
Subject: RE: Code
Hi Matthew,
It looks fine to me. I have built a similar service that allows a user to
submit a query from a browser and returns the result in JSON format.
Another alternative is to leave a Spark shell or one of the notebooks (Spark
Notebook, Zeppelin, etc.) session open and run queries from
:52 AM
To: Mohammed Guller
Cc: Matthew Johnson; user@spark.apache.org
Subject: RE: Code review - Spark SQL command-line client for Cassandra
Hi Mohammad
Can you provide more info about the Service u developed
On Jun 20, 2015 7:59 AM, Mohammed Guller
moham...@glassbeam.commailto:moham
Hi Yana,
Not sure whether you already solved this issue. As far as I know, the DataFrame
support in Spark Cassandra connector was added in version 1.3. The first
milestone release of SCC v1.3 was just announced.
Mohammed
From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Tuesday, May
Nobody using Spark SQL JDBC/Thrift server with DSE Cassandra?
Mohammed
From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Friday, May 29, 2015 11:49 AM
To: user@spark.apache.org
Subject: Anybody using Spark SQL JDBC server with DSE Cassandra?
Hi -
We have successfully integrated Spark
Brant,
You should be able to migrate most of your existing SQL code to Spark SQL, but
remember that Spark SQL does not yet support the full ANSI standard. So you may
need to rewrite some of your existing queries.
Another thing to keep in mind is that Spark SQL is not real-time. The response
Hi -
We have successfully integrated Spark SQL with Cassandra. We have a backend
that provides a REST API that allows users to execute SQL queries on data in
C*. Now we would like to also support JDBC/ODBC connectivity , so that user can
use tools like Tableau to query data in C* through the
Another option is to provide the schema to the load method. One variant of the
sqlContext.load takes a schema as a input parameter. You can define the schema
programmatically as shown here:
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
Have you looked at the new Spark ML library? You can use a DataFrame directly
with the Spark ML API.
https://spark.apache.org/docs/latest/ml-guide.html
Mohammed
From: Sourav Mazumder [mailto:sourav.mazumde...@gmail.com]
Sent: Monday, July 6, 2015 10:29 AM
To: user
Subject: How to create a
You could repartition the dataframe before saving it. However, that would
impact the parallelism of the next jobs that reads these file from HDFS.
Mohammed
-Original Message-
From: kachau [mailto:umesh.ka...@gmail.com]
Sent: Monday, July 6, 2015 10:23 AM
To: user@spark.apache.org
Hi Florian,
It depends on a number of factors. How much data are you querying? Where is the
data stored (HDD, SSD or DRAM)? What is the file format (Parquet or CSV)?
In theory, it is possible to use Spark SQL for real-time queries, but cost
increases as the data size grows. If you can store all
It is not a bad idea. Many people use this approach.
Mohammed
-Original Message-
From: Sagi r [mailto:stsa...@gmail.com]
Sent: Monday, July 6, 2015 1:58 PM
To: user@spark.apache.org
Subject: Spark application with a RESTful API
Hi,
I've been researching spark for a couple of months
Umesh,
You can create a web-service in any of the languages supported by Spark and
stream the result from this web-service to your D3-based client using Websocket
or Server-Sent Events.
For example, you can create a webservice using Play. This app will integrate
with Spark streaming in the
One options is to use the coalesce method in the RDD class.
Mohammed
From: Brandon White [mailto:bwwintheho...@gmail.com]
Sent: Tuesday, August 4, 2015 7:23 PM
To: user
Subject: Combining Spark Files with saveAsTextFile
What is the best way to make saveAsTextFile save as only a single file?
Just to further clarify, you can first call coalesce with argument 1 and then
call saveAsTextFile. For example,
rdd.coalesce(1).saveAsTextFile(...)
Mohammed
From: Mohammed Guller
Sent: Tuesday, August 4, 2015 9:39 PM
To: 'Brandon White'; user
Subject: RE: Combining Spark Files
Hi -
I am running the Thrift JDBC/ODBC server (v1.4.1) and encountered a problem
when querying tables using fully qualified table names(schemaName.tableName).
The following query works fine from the beeline tool:
SELECT * from test;
However, the following query throws an exception, even
Parquet
Mohammed
From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: Wednesday, July 22, 2015 5:48 AM
To: user
Subject: Need help in SparkSQL
HI All,
I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex
queries analysis on this data.Queries like AND queries
I could be wrong, but it looks like the only implementation available right now
is MultivariateOnlineSummarizer.
Mohammed
From: Nkechi Achara [mailto:nkach...@googlemail.com]
Sent: Wednesday, July 15, 2015 4:31 AM
To: user@spark.apache.org
Subject: Any beginner samples for using ML / MLIB to
[mailto:rishikeshtha...@hotmail.com]
Sent: Friday, July 17, 2015 12:33 AM
To: Mohammed Guller
Subject: Re: Feature Generation On Spark
Thanks I did look at the example. I am using Spark 1.2. The modules mentioned
there are not in 1.2 I guess. The import is failing
Rishi
From
.
Mohammed
From: Michael Segel [mailto:msegel_had...@hotmail.com]
Sent: Sunday, July 12, 2015 6:59 AM
To: Mohammed Guller
Cc: David Mitchell; Roman Sokolov; user; Ravisankar Mani
Subject: Re: Spark performance
Not necessarily.
It depends on the use case and what you intend to do with the data.
4-6
I responded to your question on SO. Let me know if this what you wanted.
http://stackoverflow.com/a/31528274/2336943
Mohammed
-Original Message-
From: plazaster [mailto:michaelplaz...@gmail.com]
Sent: Sunday, July 19, 2015 11:38 PM
To: user@spark.apache.org
Subject: Re: Kmeans
Michael,
How would the Catalyst optimizer optimize this version?
df.filter(df(filter_field) === value).select(field1).show()
Would it still read all the columns in df or would it read only “filter_field”
and “field1” since only two columns are used (assuming other columns from df
are not used
Thanks, Harish.
Mike – this would be a cleaner version for your use case:
df.filter(df(filter_field) === value).select(field1).show()
Mohammed
From: Harish Butani [mailto:rhbutani.sp...@gmail.com]
Sent: Monday, July 20, 2015 5:37 PM
To: Mohammed Guller
Cc: Michael Armbrust; Mike Trienis; user
Short answer: yes.
The Spark Cassandra Connector supports the data source API. So you can create a
DataFrame that points directly to a Cassandra table. You can query it using the
DataFrame API or the SQL/HiveQL interface.
If you want to see an example, see slide# 27 and 28 in this deck that I
Did you mean Hive or Spark SQL JDBC/ODBC server?
Mohammed
From: Bryan Jeffrey [mailto:bryan.jeff...@gmail.com]
Sent: Thursday, November 12, 2015 9:12 AM
To: Mohammed Guller
Cc: user
Subject: Re: Cassandra via SparkSQL/Hive JDBC
Mohammed,
That is great. It looks like a perfect scenario. Would
to manually SET it for each Beeline session.
Mohammed
From: Bryan Jeffrey [mailto:bryan.jeff...@gmail.com]
Sent: Thursday, November 12, 2015 10:26 AM
To: Mohammed Guller
Cc: user
Subject: Re: Cassandra via SparkSQL/Hive JDBC
Answer: In beeline run the following: SET
spark.cassandra.connection.host
Have you tried registering the function using the Beeline client?
Another alternative would be to create a Spark SQL UDF and launch the Spark SQL
Thrift server programmatically.
Mohammed
-Original Message-
From: ReeceRobinson [mailto:re...@therobinsons.gen.nz]
Sent: Sunday, October
You may find the spark.sql.shuffle.partitions property useful. The default
value is 200.
Mohammed
From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
Sent: Wednesday, October 14, 2015 8:14 PM
To: user
Subject: dataframes and numPartitions
A lot of RDD methods take a numPartitions
operation and then a save
operation, I don't see how caching would help.
Mohammed
-Original Message-
From: Matt Narrell [mailto:matt.narr...@gmail.com]
Sent: Tuesday, October 6, 2015 3:32 PM
To: Mohammed Guller
Cc: davidkl; user@spark.apache.org
Subject: Re: laziness in textFile
-hadoop-throws-exception-for-large-lzo-files
Mohammed
-Original Message-
From: Matt Narrell [mailto:matt.narr...@gmail.com]
Sent: Tuesday, October 6, 2015 4:08 PM
To: Mohammed Guller
Cc: davidkl; user@spark.apache.org
Subject: Re: laziness in textFile reading from HDFS?
Agreed. This is spark
Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute engines,
which need to be paired with a storage system. Seconds, they are designed for
processing large distributed datasets. If you have only 100,000 records or even
a million records, you don’t need Spark. A RDBMS
To: Roman Sokolov
Cc: Mohammed Guller; user; Ravisankar Mani
Subject: Re: Spark performance
You can certainly query over 4 TB of data with Spark. However, you will get an
answer in minutes or hours, not in milliseconds or seconds. OLTP databases are
used for web applications, and typically return
Take a look at the examples here:
https://spark.apache.org/docs/latest/ml-guide.html
Mohammed
From: rishikesh thakur [mailto:rishikeshtha...@hotmail.com]
Sent: Saturday, July 4, 2015 10:49 PM
To: ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark
I have one document
Jagat Singh [mailto:jagatsi...@gmail.com]
Sent: Tuesday, September 29, 2015 6:32 PM
To: Mohammed Guller
Cc: SparkUser
Subject: Re: Spark thrift service and Hive impersonation.
Hi,
Thanks for your reply.
If you see the log message
Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to
1) It is not required to have the same amount of memory as data.
2) By default the # of partitions are equal to the number of HDFS blocks
3) Yes, the read operation is lazy
4) It is okay to have more number of partitions than number of cores.
Mohammed
-Original Message-
From: davidkl
Does each user needs to start own thrift server to use it?
No. One of the benefits of the Spark Thrift Server is that it allows multiple
users to share a single SparkContext.
Most likely, you have file permissions issue.
Mohammed
From: Jagat Singh [mailto:jagatsi...@gmail.com]
Sent: Tuesday,
Is there any specific reason for caching the RDD? How many passes you make over
the dataset?
Mohammed
-Original Message-
From: Matt Narrell [mailto:matt.narr...@gmail.com]
Sent: Saturday, October 3, 2015 9:50 PM
To: Mohammed Guller
Cc: davidkl; user@spark.apache.org
Subject: Re
1 - 100 of 159 matches
Mail list logo