RE: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-12 Thread Mohammed Guller
as they happen, I might lean towards Kafka streaming. Agree about the benefits of using SQL with structured streaming. Mohammed From: kant kodali [mailto:kanth...@gmail.com] Sent: Sunday, June 11, 2017 3:41 PM To: Mohammed Guller <moham...@glassbeam.com> Cc: vincent gromakowski <vincent

RE: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread Mohammed Guller
Just to elaborate more on Vincent wrote – Kafka streaming provides true record-at-a-time processing capabilities whereas Spark Streaming provides micro-batching capabilities on top of Spark. Depending on your use case, you may find one better than the other. Both provide stateless ad stateful

RE: Explanation regarding Spark Streaming

2016-08-06 Thread Mohammed Guller
-serialization Mohammed Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Saturday, August 6, 2016 12:25 PM To: Mohammed Guller Cc: Jacek Laskowski; Saurav Sinha; user Subje

RE: Explanation regarding Spark Streaming

2016-08-06 Thread Mohammed Guller
performance even worse. Mohammed From: Jacek Laskowski [mailto:ja...@japila.pl] Sent: Saturday, August 6, 2016 1:54 AM To: Mohammed Guller Cc: Saurav Sinha; user Subject: RE: Explanation regarding Spark Streaming Hi, Thanks for explanation, but it does not prove Spark will OOM at some point. You

RE: Explanation regarding Spark Streaming

2016-08-05 Thread Mohammed Guller
[mailto:ja...@japila.pl] Sent: Thursday, August 4, 2016 4:25 PM To: Mohammed Guller Cc: Saurav Sinha; user Subject: Re: Explanation regarding Spark Streaming On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller <moham...@glassbeam.com> wrote: > and eventually you will run out of memory.

RE: Explanation regarding Spark Streaming

2016-08-04 Thread Mohammed Guller
The backlog will increase as time passes and eventually you will run out of memory. Mohammed Author: Big Data Analytics with Spark From: Saurav Sinha [mailto:sauravsinh...@gmail.com] Sent: Wednesday, August 3, 2016

RE: Spark SQL driver memory keeps rising

2016-06-16 Thread Mohammed Guller
Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Khaled Hammouda [mailto:khaled.hammo...@kik.com] Sent: Thursday, June 16, 2016 11:45 AM To: Mohammed Guller Cc: user Subject: Re: Spark SQL driver memory keeps rising I'm using pyspark and running in YARN client mode. I managed to ano

RE: Spark SQL driver memory keeps rising

2016-06-15 Thread Mohammed Guller
It would be hard to guess what could be going on without looking at the code. It looks like the driver program goes into a long stop-the-world GC pause. This should not happen on the machine running the driver program if all that you are doing is reading data from HDFS, perform a bunch of

RE: concat spark dataframes

2016-06-15 Thread Mohammed Guller
com] Sent: Wednesday, June 15, 2016 4:08 PM To: Mohammed Guller Cc: Natu Lauchande; user Subject: Re: concat spark dataframes Hey, There are quite a lot of fields. But, there are no common fields between the 2 dataframes. Can I not concatenate the 2 frames like we can do in pandas such that the

RE: Spark 2.0 release date

2016-06-15 Thread Mohammed Guller
Andy – instead of Naïve Bayes, you should have used the Multi-layer Perceptron classifier ☺ Mohammed Author: Big Data Analytics with Spark From: andy petrella [mailto:andy.petre...@gmail.com] Sent: Wednesday, June 15,

RE: concat spark dataframes

2016-06-15 Thread Mohammed Guller
Hi Misha, What is the schema for both the DataFrames? And what is the expected schema of the resulting DataFrame? Mohammed Author: Big Data Analytics with Spark From: Natu Lauchande [mailto:nlaucha...@gmail.com] Sent:

RE: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Mohammed Guller
Vertica also provides a Spark connector. It was not GA the last time I looked at it, but available on the Vertica community site. Have you tried using the Vertica Spark connector instead of the JDBC driver? Mohammed Author: Big Data Analytics with

RE: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Mohammed Guller
en.sla...@instaclustr.com] Sent: Tuesday, May 17, 2016 11:00 PM To: u...@cassandra.apache.org; Mohammed Guller Cc: user Subject: Re: Accessing Cassandra data from Spark Shell It definitely should be possible for 1.5.2 (I have used it with spark-shell and cassandra connector with 1.4.x). The main

RE: Accessing Cassandra data from Spark Shell

2016-05-10 Thread Mohammed Guller
Yes, it is very simple to access Cassandra data using Spark shell. Step 1: Launch the spark-shell with the spark-cassandra-connector package $SPARK_HOME/bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0 Step 2: Create a DataFrame pointing to your Cassandra table

RE: Reading table schema from Cassandra

2016-05-10 Thread Mohammed Guller
You can create a DataFrame directly from a Cassandra table using something like this: val dfCassTable = sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "your_column_family", "keyspace" -> "your_keyspace")).load() Then, you can get schema: val dfCassTableSchema

RE: Spark standalone workers, executors and JVMs

2016-05-04 Thread Mohammed Guller
May 2, 2016 at 7:47 PM, Mohammed Guller <moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote: The workers and executors run as separate JVM processes in the standalone mode. The use of multiple workers on a single machine depends on how you will be using the clusters. If you

RE: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Mohammed Guller
You can run multiple Spark applications simultaneously. Just limit the # of cores and memory allocated to each application. For example, if each node has 8 cores and there are 10 nodes and you want to be able to run 4 applications simultaneously, limit the # of cores for each application to 20.

RE: Spark standalone workers, executors and JVMs

2016-05-02 Thread Mohammed Guller
The workers and executors run as separate JVM processes in the standalone mode. The use of multiple workers on a single machine depends on how you will be using the clusters. If you run multiple Spark applications simultaneously, each application gets its own its executor. So, for example, if

RE: Create tab separated file from a dataframe spark 1.4 with Java

2016-04-21 Thread Mohammed Guller
It should be straightforward to do this using the spark-csv package. Assuming “myDF” is your DataFrame, you can use the following code to save data in a TSV file. myDF.write .format("com.databricks.spark.csv") .option("delimiter", "\t") .save("data.tsv") Mohammed From: Mail.com

Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Mohammed Guller
My book on Spark was recently published. I would like to request it to be added to the Books section on Spark's website. Here are the details about the book. Title: Big Data Analytics with Spark Author: Mohammed Guller Link: www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656

updating the Books section on the Spark documentation page

2016-03-08 Thread Mohammed Guller
Hi - The Spark documentation page (http://spark.apache.org/documentation.html) has links to books covering Spark. What is the process for adding a new book to that list? Thanks, Mohammed Author: Big Data Analytics with

RE: convert SQL multiple Join in Spark

2016-03-03 Thread Mohammed Guller
Why not use Spark SQL? Mohammed Author: Big Data Analytics with Spark From: Vikash Kumar [mailto:vikashsp...@gmail.com] Sent: Wednesday, March 2, 2016 8:29 PM To: user@spark.apache.org Subject: convert SQL multiple

RE: Stage contains task of large size

2016-03-03 Thread Mohammed Guller
Just to elaborate more on what Silvio wrote below, check whether you are referencing a class or object member variable in a function literal/closure passed to one of the RDD methods. Mohammed Author: Big Data Analytics with

RE: Update edge weight in graphx

2016-03-01 Thread Mohammed Guller
Like RDDs, Graphs are also immutable. Mohammed Author: Big Data Analytics with Spark -Original Message- From: naveen.marri [mailto:naveenkumarmarri6...@gmail.com] Sent: Monday, February 29, 2016 9:11 PM To: user@spark.apache.org Subject: Update edge weight in graphx Hi, I'm

RE: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-03-01 Thread Mohammed Guller
I agree that the Spark official documentation is pretty good. However, a book also serves a useful purpose. It provides a structured roadmap for learning a new technology. Everything is nicely organized for the reader. For somebody who has just started learning Spark, the amount of material on

RE: Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Mohammed Guller
I believe the OP is referring to the application UI on port 4040. The application UI on port 4040 is available only while application is running. As per the documentation: To view the web UI after the fact, set spark.eventLog.enabled to true before starting the application. This configures

RE: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Mohammed Guller
Hi Ashok, Another book recommendation (I am the author): “Big Data Analytics with Spark” The first half of the book is specifically written for people just getting started with Big Data and Spark. Mohammed Author: Big Data Analytics with

RE: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-27 Thread Mohammed Guller
ce we are using the default value for epred. HTH. Mohammed Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Guillermo Ortiz [mailto:konstt2...@gmail.com] Sent: Saturday, February 27, 2016 6:17 AM To: Mohammed Guller

RE: Standalone vs. Mesos for production installation on a smallish cluster

2016-02-26 Thread Mohammed Guller
I think you may be referring to Spark Survey 2015. According to that survey, 48% use standalone, 40% use YARN and only 11% use Mesos (the numbers don’t add up to 100 – probably because of rounding error). Mohammed Author: Big Data Analytics with

RE: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-26 Thread Mohammed Guller
Here is another solution (minGraph is the graph from your code. I assume that is your original graph): val graphWithNoOutEdges = minGraph.filter( graph => graph.outerJoinVertices(graph.outDegrees) {(vId, vData, outDegreesOpt) => outDegreesOpt.getOrElse(0)}, vpred = (vId: VertexId,

RE: Clarification on RDD

2016-02-26 Thread Mohammed Guller
HDFS, as the name implies, is a distributed file system. A file stored on HDFS is already distributed. So if you create an RDD from a HDFS file, the created RDD just points to the file partitions on different nodes. You can read more about HDFS here.

RE: Can we load csv partitioned data into one DF?

2016-02-22 Thread Mohammed Guller
Are all the csv files in the same directory? Mohammed Author: Big Data Analytics with Spark From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Monday, February 22, 2016 7:25 AM To:

RE: Check if column exists in Schema

2016-02-15 Thread Mohammed Guller
The DataFrame class has a method named columns, which returns all column names as an array. You can then use the contains method in the Scala Array class to check whether a column exists. Mohammed Author: Big Data Analytics with

RE: [MLLIB] Best way to extract RandomForest decision splits

2016-02-10 Thread Mohammed Guller
Why not use the save method from the RandomForestModel class to save a model at a specified path? Mohammed Author: Big Data Analytics with Spark -Original Message- From: jluan [mailto:jaylu...@gmail.com] Sent: Wednesday, February 10, 2016 5:57 PM To: user@spark.apache.org

RE: [MLLIB] Best way to extract RandomForest decision splits

2016-02-10 Thread Mohammed Guller
may be able to find other alternatives. Mohammed Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Jay Luan [mailto:jaylu...@gmail.com] Sent: Wednesday, February 10, 2016 7:27 PM To: Mohammed Guller Cc: user@spark.apac

RE: spark-cassandra-connector BulkOutputWriter

2016-02-09 Thread Mohammed Guller
Alex – I suggest posting this question on the Spark Cassandra Connector mailing list. The SCC developers are pretty responsive. Mohammed Author: Big Data Analytics with Spark From: Alexandr Dzhagriev

RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
You may have better luck with this question on the Spark Cassandra Connector mailing list. One quick question about this code from your email: // Load DataFrame from C* data-source val base_data = base_data_df.getInstance(sqlContext) What exactly is base_data_df and how are

RE: How to collect/take arbitrary number of records in the driver?

2016-02-09 Thread Mohammed Guller
You can do something like this: val indexedRDD = rdd.zipWithIndex val filteredRDD = indexedRDD.filter{case(element, index) => (index >= 99) && (index < 199)} val result = filteredRDD.take(100) Warning: the ordering of the elements in the RDD is not guaranteed. Mohammed Author: Big Data

RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
rom: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch] Sent: Tuesday, February 9, 2016 10:05 PM To: Mohammed Guller Cc: user@spark.apache.org Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames Hi Mohammed Thanks for hint, I should probably do that :) As for the DF

RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
with Spark -Original Message- From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch] Sent: Tuesday, February 9, 2016 10:47 PM To: Mohammed Guller Cc: user@spark.apache.org Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames Hi Mohammed I'm aware of that documentation, what

RE: submit spark job with spcified file for driver

2016-02-04 Thread Mohammed Guller
Here is the description for the --file option that you can specify to spark-submit: --files FILES Comma-separated list of files to be placed in the working directory of each executor. Mohammed Author: Big Data Analytics with Spark -Original Message- From: alexeyy3

RE: add new column in the schema + Dataframe

2016-02-04 Thread Mohammed Guller
Hi Divya, You can use the withColumn method from the DataFrame API. Here is the method signature: def withColumn(colName: String, col: Column): DataFrame Mohammed Author: Big Data Analytics with

RE: spark-cassandra

2016-02-03 Thread Mohammed Guller
Another thing to check is what version of the Spark-Cassandra-Connector the Spark Job server passing to the workers. It looks like when you use Spark-submit, you are sending the correct SCC jar, but the Spark Job server may be using a different one. Mohammed Author: Big Data Analytics with

RE: Spark 1.5.2 memory error

2016-02-03 Thread Mohammed Guller
Nirav, Sorry to hear about your experience with Spark; however, sucks is a very strong word. Many organizations are processing a lot more than 150GB of data with Spark. Mohammed Author: Big Data Analytics with Spark

RE: Cassandra BEGIN BATCH

2016-02-03 Thread Mohammed Guller
Frank, I don’t think so. Cassandra does not support transactions in the traditional sense. It is not an ACID compliant database. Mohammed Author: Big Data Analytics with Spark From: Ted Yu [mailto:yuzhih...@gmail.com]

RE: how to introduce spark to your colleague if he has no background about *** spark related

2016-02-02 Thread Mohammed Guller
Hi Charles, You may find slides 16-20 from this deck useful: http://www.slideshare.net/mg007/big-data-trends-challenges-opportunities-57744483 I used it for a talk that I gave to MS students last week. I wanted to give them some context before describing Spark. It doesn’t cover all the stuff

RE: saveAsTextFile is not writing to local fs

2016-02-01 Thread Mohammed Guller
need to be aware of how big that data is and related implications. Mohammed Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Siva [mailto:sbhavan...@gmail.com] Sent: Monday, February 1, 2016 6:00 PM To: Mohammed Gul

RE: saveAsTextFile is not writing to local fs

2016-02-01 Thread Mohammed Guller
Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Siva [mailto:sbhavan...@gmail.com] Sent: Friday, January 29, 2016 5:40 PM To: Mohammed Guller Cc: spark users Subject: Re: saveAsTextFile is not writing to local fs Hi Mohammed, Thanks fo

RE: saveAsTextFile is not writing to local fs

2016-01-29 Thread Mohammed Guller
Is it a multi-node cluster or you running Spark on a single machine? You can change Spark’s logging level to INFO or DEBUG to see what is going on. Mohammed Author: Big Data Analytics with Spark From: Siva

RE: JSON to SQL

2016-01-28 Thread Mohammed Guller
You don’t need Hive for that. The DataFrame class has a method named explode, which provides the same functionality. Here is an example from the Spark API documentation: df.explode("words", "word"){words: String => words.split(" ")} The first argument to the explode method is the name of the

RE: a question about web ui log

2016-01-26 Thread Mohammed Guller
ttp://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Philip Lee [mailto:philjj...@gmail.com] Sent: Tuesday, January 26, 2016 5:12 AM To: Mohammed Guller Cc: user@spark.apache.org Subject: Re: a question about web ui log Yes, I tried it, but it simply does not work. so, my c

RE: withColumn

2016-01-26 Thread Mohammed Guller
Naga – I believe that the second argument to the withColumn method has to be a column calculated from the source DataFrame on which you call that method. The following will work: df2.withColumn("age2", $"age"+10) Mohammed Author: Big Data Analytics with

RE: a question about web ui log

2016-01-25 Thread Mohammed Guller
I am not sure whether you can copy the log files from Spark workers to your local machine and view it from the Web UI. In fact, if you are able to copy the log files locally, you can just view them directly in any text editor. I suspect what you really want to see is the application history.

RE: Spark Cassandra clusters

2016-01-22 Thread Mohammed Guller
Vivek, By default, Cassandra uses ¼ of the system memory, so in your case, it will be around 8GB, which is fine. If you have more Cassandra related question, it is better to post it on the Cassandra mailing list. Also feel free to email me directly. Mohammed Author: Big Data Analytics with

RE: Date / time stuff with spark.

2016-01-22 Thread Mohammed Guller
Hi Andrew, Here is another option. You can define custom schema to specify the correct type for the time column as shown below: import org.apache.spark.sql.types._ val customSchema = StructType( StructField("a", IntegerType, false) :: StructField("b", LongType, false) ::

RE: Is it possible to use SparkSQL JDBC ThriftServer without Hive

2016-01-15 Thread Mohammed Guller
tables. Mohammed -Original Message- From: Sambit Tripathy (RBEI/EDS1) [mailto:sambit.tripa...@in.bosch.com] Sent: Friday, January 15, 2016 11:30 AM To: Mohammed Guller; angela.whelan; user@spark.apache.org Subject: RE: Is it possible to use SparkSQL JDBC ThriftServer without Hive Hi

RE: Is it possible to use SparkSQL JDBC ThriftServer without Hive

2016-01-13 Thread Mohammed Guller
Hi Angela, Yes, you can use Spark SQL JDBC/ThriftServer without Hive. Mohammed -Original Message- From: angela.whelan [mailto:angela.whe...@synchronoss.com] Sent: Wednesday, January 13, 2016 3:37 AM To: user@spark.apache.org Subject: Is it possible to use SparkSQL JDBC ThriftServer

RE: Spark ignores SPARK_WORKER_MEMORY?

2016-01-13 Thread Mohammed Guller
Barak, The SPARK_WORKER_MEMORYsetting is used for allocating memory to executors. You can use SPARK_DAEMON_MEMORY to set memory for the worker JVM. Mohammed From: Barak Yaish [mailto:barak.ya...@gmail.com] Sent: Wednesday, January 13, 2016 12:59 AM To: user@spark.apache.org Subject: Spark

RE: spark job failure - akka error Association with remote system has failed

2016-01-13 Thread Mohammed Guller
Check the entries in your /etc/hosts file. Also check what the hostname command returns. Mohammed From: vivek.meghanat...@wipro.com [mailto:vivek.meghanat...@wipro.com] Sent: Tuesday, January 12, 2016 11:36 PM To: user@spark.apache.org Subject: RE: spark job failure - akka error Association

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Mohammed Guller
Did you mean Hive or Spark SQL JDBC/ODBC server? Mohammed From: Bryan Jeffrey [mailto:bryan.jeff...@gmail.com] Sent: Thursday, November 12, 2015 9:12 AM To: Mohammed Guller Cc: user Subject: Re: Cassandra via SparkSQL/Hive JDBC Mohammed, That is great. It looks like a perfect scenario. Would

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Mohammed Guller
to manually SET it for each Beeline session. Mohammed From: Bryan Jeffrey [mailto:bryan.jeff...@gmail.com] Sent: Thursday, November 12, 2015 10:26 AM To: Mohammed Guller Cc: user Subject: Re: Cassandra via SparkSQL/Hive JDBC Answer: In beeline run the following: SET spark.cassandra.connection.host

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread Mohammed Guller
Short answer: yes. The Spark Cassandra Connector supports the data source API. So you can create a DataFrame that points directly to a Cassandra table. You can query it using the DataFrame API or the SQL/HiveQL interface. If you want to see an example, see slide# 27 and 28 in this deck that I

RE: Spark SQL Thriftserver and Hive UDF in Production

2015-10-18 Thread Mohammed Guller
Have you tried registering the function using the Beeline client? Another alternative would be to create a Spark SQL UDF and launch the Spark SQL Thrift server programmatically. Mohammed -Original Message- From: ReeceRobinson [mailto:re...@therobinsons.gen.nz] Sent: Sunday, October

RE: dataframes and numPartitions

2015-10-15 Thread Mohammed Guller
You may find the spark.sql.shuffle.partitions property useful. The default value is 200. Mohammed From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com] Sent: Wednesday, October 14, 2015 8:14 PM To: user Subject: dataframes and numPartitions A lot of RDD methods take a numPartitions

RE: laziness in textFile reading from HDFS?

2015-10-06 Thread Mohammed Guller
operation and then a save operation, I don't see how caching would help. Mohammed -Original Message- From: Matt Narrell [mailto:matt.narr...@gmail.com] Sent: Tuesday, October 6, 2015 3:32 PM To: Mohammed Guller Cc: davidkl; user@spark.apache.org Subject: Re: laziness in textFile

RE: laziness in textFile reading from HDFS?

2015-10-06 Thread Mohammed Guller
-hadoop-throws-exception-for-large-lzo-files Mohammed -Original Message- From: Matt Narrell [mailto:matt.narr...@gmail.com] Sent: Tuesday, October 6, 2015 4:08 PM To: Mohammed Guller Cc: davidkl; user@spark.apache.org Subject: Re: laziness in textFile reading from HDFS? Agreed. This is spark

RE: laziness in textFile reading from HDFS?

2015-10-05 Thread Mohammed Guller
Is there any specific reason for caching the RDD? How many passes you make over the dataset? Mohammed -Original Message- From: Matt Narrell [mailto:matt.narr...@gmail.com] Sent: Saturday, October 3, 2015 9:50 PM To: Mohammed Guller Cc: davidkl; user@spark.apache.org Subject: Re

RE: Spark thrift service and Hive impersonation.

2015-09-29 Thread Mohammed Guller
Jagat Singh [mailto:jagatsi...@gmail.com] Sent: Tuesday, September 29, 2015 6:32 PM To: Mohammed Guller Cc: SparkUser Subject: Re: Spark thrift service and Hive impersonation. Hi, Thanks for your reply. If you see the log message Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to

RE: laziness in textFile reading from HDFS?

2015-09-29 Thread Mohammed Guller
1) It is not required to have the same amount of memory as data. 2) By default the # of partitions are equal to the number of HDFS blocks 3) Yes, the read operation is lazy 4) It is okay to have more number of partitions than number of cores. Mohammed -Original Message- From: davidkl

RE: Spark thrift service and Hive impersonation.

2015-09-29 Thread Mohammed Guller
Does each user needs to start own thrift server to use it? No. One of the benefits of the Spark Thrift Server is that it allows multiple users to share a single SparkContext. Most likely, you have file permissions issue. Mohammed From: Jagat Singh [mailto:jagatsi...@gmail.com] Sent: Tuesday,

RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
One options is to use the coalesce method in the RDD class. Mohammed From: Brandon White [mailto:bwwintheho...@gmail.com] Sent: Tuesday, August 4, 2015 7:23 PM To: user Subject: Combining Spark Files with saveAsTextFile What is the best way to make saveAsTextFile save as only a single file?

RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
Just to further clarify, you can first call coalesce with argument 1 and then call saveAsTextFile. For example, rdd.coalesce(1).saveAsTextFile(...) Mohammed From: Mohammed Guller Sent: Tuesday, August 4, 2015 9:39 PM To: 'Brandon White'; user Subject: RE: Combining Spark Files

Spark SQL unable to recognize schema name

2015-08-04 Thread Mohammed Guller
Hi - I am running the Thrift JDBC/ODBC server (v1.4.1) and encountered a problem when querying tables using fully qualified table names(schemaName.tableName). The following query works fine from the beeline tool: SELECT * from test; However, the following query throws an exception, even

RE: Heatmap with Spark Streaming

2015-07-30 Thread Mohammed Guller
Umesh, You can create a web-service in any of the languages supported by Spark and stream the result from this web-service to your D3-based client using Websocket or Server-Sent Events. For example, you can create a webservice using Play. This app will integrate with Spark streaming in the

RE: Need help in SparkSQL

2015-07-22 Thread Mohammed Guller
Parquet Mohammed From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Wednesday, July 22, 2015 5:48 AM To: user Subject: Need help in SparkSQL HI All, I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries

RE: Kmeans Labeled Point RDD

2015-07-20 Thread Mohammed Guller
I responded to your question on SO. Let me know if this what you wanted. http://stackoverflow.com/a/31528274/2336943 Mohammed -Original Message- From: plazaster [mailto:michaelplaz...@gmail.com] Sent: Sunday, July 19, 2015 11:38 PM To: user@spark.apache.org Subject: Re: Kmeans

RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
Michael, How would the Catalyst optimizer optimize this version? df.filter(df(filter_field) === value).select(field1).show() Would it still read all the columns in df or would it read only “filter_field” and “field1” since only two columns are used (assuming other columns from df are not used

RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
Thanks, Harish. Mike – this would be a cleaner version for your use case: df.filter(df(filter_field) === value).select(field1).show() Mohammed From: Harish Butani [mailto:rhbutani.sp...@gmail.com] Sent: Monday, July 20, 2015 5:37 PM To: Mohammed Guller Cc: Michael Armbrust; Mike Trienis; user

RE: Feature Generation On Spark

2015-07-18 Thread Mohammed Guller
[mailto:rishikeshtha...@hotmail.com] Sent: Friday, July 17, 2015 12:33 AM To: Mohammed Guller Subject: Re: Feature Generation On Spark Thanks I did look at the example. I am using Spark 1.2. The modules mentioned there are not in 1.2 I guess. The import is failing Rishi From

RE: Any beginner samples for using ML / MLIB to produce a moving average of a (K, iterable[V])

2015-07-15 Thread Mohammed Guller
I could be wrong, but it looks like the only implementation available right now is MultivariateOnlineSummarizer. Mohammed From: Nkechi Achara [mailto:nkach...@googlemail.com] Sent: Wednesday, July 15, 2015 4:31 AM To: user@spark.apache.org Subject: Any beginner samples for using ML / MLIB to

RE: Spark performance

2015-07-13 Thread Mohammed Guller
. Mohammed From: Michael Segel [mailto:msegel_had...@hotmail.com] Sent: Sunday, July 12, 2015 6:59 AM To: Mohammed Guller Cc: David Mitchell; Roman Sokolov; user; Ravisankar Mani Subject: Re: Spark performance Not necessarily. It depends on the use case and what you intend to do with the data. 4-6

RE: Spark performance

2015-07-11 Thread Mohammed Guller
To: Roman Sokolov Cc: Mohammed Guller; user; Ravisankar Mani Subject: Re: Spark performance You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return

RE: Spark performance

2015-07-10 Thread Mohammed Guller
Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired with a storage system. Seconds, they are designed for processing large distributed datasets. If you have only 100,000 records or even a million records, you don’t need Spark. A RDBMS

RE: Feature Generation On Spark

2015-07-09 Thread Mohammed Guller
Take a look at the examples here: https://spark.apache.org/docs/latest/ml-guide.html Mohammed From: rishikesh thakur [mailto:rishikeshtha...@hotmail.com] Sent: Saturday, July 4, 2015 10:49 PM To: ayan guha; Michal Čizmazia Cc: user Subject: RE: Feature Generation On Spark I have one document

RE: How to create a LabeledPoint RDD from a Data Frame

2015-07-06 Thread Mohammed Guller
Have you looked at the new Spark ML library? You can use a DataFrame directly with the Spark ML API. https://spark.apache.org/docs/latest/ml-guide.html Mohammed From: Sourav Mazumder [mailto:sourav.mazumde...@gmail.com] Sent: Monday, July 6, 2015 10:29 AM To: user Subject: How to create a

RE: How do we control output part files created by Spark job?

2015-07-06 Thread Mohammed Guller
You could repartition the dataframe before saving it. However, that would impact the parallelism of the next jobs that reads these file from HDFS. Mohammed -Original Message- From: kachau [mailto:umesh.ka...@gmail.com] Sent: Monday, July 6, 2015 10:23 AM To: user@spark.apache.org

RE: Spark SQL queries hive table, real time ?

2015-07-06 Thread Mohammed Guller
Hi Florian, It depends on a number of factors. How much data are you querying? Where is the data stored (HDD, SSD or DRAM)? What is the file format (Parquet or CSV)? In theory, it is possible to use Spark SQL for real-time queries, but cost increases as the data size grows. If you can store all

RE: Spark application with a RESTful API

2015-07-06 Thread Mohammed Guller
It is not a bad idea. Many people use this approach. Mohammed -Original Message- From: Sagi r [mailto:stsa...@gmail.com] Sent: Monday, July 6, 2015 1:58 PM To: user@spark.apache.org Subject: Spark application with a RESTful API Hi, I've been researching spark for a couple of months

RE: making dataframe for different types using spark-csv

2015-07-01 Thread Mohammed Guller
Another option is to provide the schema to the load method. One variant of the sqlContext.load takes a schema as a input parameter. You can define the schema programmatically as shown here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Mohammed Guller
I haven’t tried using Zeppelin with Spark on Cassandra, so can’t say for sure, but it should not be difficult. Mohammed From: Matthew Johnson [mailto:matt.john...@algomi.com] Sent: Monday, June 22, 2015 2:15 AM To: Mohammed Guller; shahid ashraf Cc: user@spark.apache.org Subject: RE: Code

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-20 Thread Mohammed Guller
:52 AM To: Mohammed Guller Cc: Matthew Johnson; user@spark.apache.org Subject: RE: Code review - Spark SQL command-line client for Cassandra Hi Mohammad Can you provide more info about the Service u developed On Jun 20, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.commailto:moham

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-19 Thread Mohammed Guller
Hi Matthew, It looks fine to me. I have built a similar service that allows a user to submit a query from a browser and returns the result in JSON format. Another alternative is to leave a Spark shell or one of the notebooks (Spark Notebook, Zeppelin, etc.) session open and run queries from

RE: Cassandra Submit

2015-06-09 Thread Mohammed Guller
-cassandra-2.1.5$ bin/cassandra-cli -h 127.0.0.1 -p 9160 Mohammed From: Yasemin Kaya [mailto:godo...@gmail.com] Sent: Tuesday, June 9, 2015 11:32 AM To: Yana Kadiyska Cc: Gerard Maas; Mohammed Guller; user@spark.apache.org Subject: Re: Cassandra Submit I removed core and streaming jar

RE: Cassandra Submit

2015-06-09 Thread Mohammed Guller
jar has the wrong version of the library that SCC is trying to use. Welcome to jar hell! Mohammed From: Yasemin Kaya [mailto:godo...@gmail.com] Sent: Tuesday, June 9, 2015 12:24 PM To: Mohammed Guller Cc: Yana Kadiyska; Gerard Maas; user@spark.apache.org Subject: Re: Cassandra Submit My code

RE: Cassandra Submit

2015-06-05 Thread Mohammed Guller
Check your spark.cassandra.connection.host setting. It should be pointing to one of your Cassandra nodes. Mohammed From: Yasemin Kaya [mailto:godo...@gmail.com] Sent: Friday, June 5, 2015 7:31 AM To: user@spark.apache.org Subject: Cassandra Submit Hi, I am using cassandraDB in my project. I

RE: Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-06-04 Thread Mohammed Guller
I am considering DSE, which has integrated Spark SQL Thrift/JDBC server with Cassandra. Mohammed From: Deenar Toraskar [mailto:deenar.toras...@gmail.com] Sent: Thursday, June 4, 2015 7:42 AM To: Mohammed Guller Cc: user@spark.apache.org Subject: Re: Anybody using Spark SQL JDBC server with DSE

RE: Make HTTP requests from within Spark

2015-06-03 Thread Mohammed Guller
The short answer is yes. How you do it depends on a number of factors. Assuming you want to build an RDD from the responses and then analyze the responses using Spark core (not Spark Streaming), here is one simple way to do it: 1) Implement a class or function that connects to a web service and

RE: Need some Cassandra integration help

2015-06-01 Thread Mohammed Guller
Hi Yana, Not sure whether you already solved this issue. As far as I know, the DataFrame support in Spark Cassandra connector was added in version 1.3. The first milestone release of SCC v1.3 was just announced. Mohammed From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Tuesday, May

RE: Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-06-01 Thread Mohammed Guller
Nobody using Spark SQL JDBC/Thrift server with DSE Cassandra? Mohammed From: Mohammed Guller [mailto:moham...@glassbeam.com] Sent: Friday, May 29, 2015 11:49 AM To: user@spark.apache.org Subject: Anybody using Spark SQL JDBC server with DSE Cassandra? Hi - We have successfully integrated Spark

RE: Migrate Relational to Distributed

2015-06-01 Thread Mohammed Guller
Brant, You should be able to migrate most of your existing SQL code to Spark SQL, but remember that Spark SQL does not yet support the full ANSI standard. So you may need to rewrite some of your existing queries. Another thing to keep in mind is that Spark SQL is not real-time. The response

  1   2   >