Re: SparkSQL production readiness

2015-03-02 Thread Daniel Siegmann
OK, good to know data frames are still experimental. Thanks Michael. On Mon, Mar 2, 2015 at 12:37 PM, Michael Armbrust mich...@databricks.com wrote: We have been using Spark SQL in production for our customers at Databricks for almost a year now. We also know of some very large production

Re: JavaRDD method ambiguous after upgrading to Java 8

2015-03-02 Thread Sean Owen
What's your actual code? that can't compile since groupBy would return a JavaPairRDD. I tried compiling that (after changing to void type) with Java 7 and Java 8 (meaning, not just the JDK but compiling for the language level too) and both worked. On Mon, Mar 2, 2015 at 10:03 PM, btiernay

Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077

2015-03-02 Thread Ted Yu
bq. Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077 There should be some more output following the above line. Can you post them ? Cheers On Mon, Mar 2, 2015 at 2:06 PM, Krishnanand Khambadkone kkhambadk...@yahoo.com.invalid wrote: Hi, I am

Re: Dataframe v/s SparkSQL

2015-03-02 Thread Michael Armbrust
They are the same. These are just different ways to construct catalyst logical plans. On Mon, Mar 2, 2015 at 12:50 PM, Manoj Samel manojsamelt...@gmail.com wrote: Is it correct to say that Spark Dataframe APIs are implemented using same execution as SparkSQL ? In other words, while the

Re: JavaRDD method ambiguous after upgrading to Java 8

2015-03-02 Thread btiernay
Seem like upgrading to 1.2.0 fixed the error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JavaRDD-method-ambiguous-after-upgrading-to-Java-8-tp21882p21883.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

JavaRDD method ambiguous after upgrading to Java 8

2015-03-02 Thread btiernay
The following method demonstrates the issue: private static Tuple2String, String group(JavaPairRDDString, String rdd, FunctionTuple2lt;String, String, String f) { return rdd.groupBy(f); } I get the following compilation error using Spark 1.1.1 and Java 8u31: The method

Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077

2015-03-02 Thread Krishnanand Khambadkone
Hi,  I am running spark on my mac.   It is reading from a kafka topic and then writes the data to a hbase table.  When I do a spark submit,  I get this error, Error connecting to master spark://localhost:7077 (akka.tcp://sparkMaster@localhost:7077), exiting. Cause was:

RDD partitions per executor in Cassandra Spark Connector

2015-03-02 Thread Rumph, Frens Jan
Hi all, I didn't find the *issues* button on https://github.com/datastax/spark-cassandra-connector/ so posting here. Any one have an idea why token ranges are grouped into one partition per executor? I expected at least one per core. Any suggestions on how to work around this? Doing a

Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077

2015-03-02 Thread Krishnanand Khambadkone
This is the line, Error connecting to master spark://localhost:7077 (akka.tcp://sparkMaster@localhost:7077), exiting. On Monday, March 2, 2015 2:42 PM, Ted Yu yuzhih...@gmail.com wrote: bq. Cause was: akka.remote.InvalidAssociation: Invalid address:

Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077

2015-03-02 Thread Krishnanand Khambadkone
I ran it with the --verbose option and I see this output Using properties file: null Parsed arguments:   master  spark://localhost:7077   deployMode  cluster   executorMemory  1g   executorCores   null   totalExecutorCores  null   propertiesFile 

Re: Executing hive query from Spark code

2015-03-02 Thread Ted Yu
Here is snippet of dependency tree for spark-hive module: [INFO] org.apache.spark:spark-hive_2.10:jar:1.3.0-SNAPSHOT ... [INFO] +- org.spark-project.hive:hive-metastore:jar:0.13.1a:compile [INFO] | +- org.spark-project.hive:hive-shims:jar:0.13.1a:compile [INFO] | | +-

Spark UI and running spark-submit with --master yarn

2015-03-02 Thread Anupama Joshi
Hi , 1. When I run my application with --master yarn-cluster or --master yarn --deploy-mode cluster , I can not the spark UI at the location -- masternode:4040Even if I am running the job , I can not see teh SPARK UI. 2. When I run with --master yarn --deploy-mode client -- I see

Re: Spark UI and running spark-submit with --master yarn

2015-03-02 Thread Marcelo Vanzin
That's the RM's RPC port, not the web UI port. (See Ted's e-mail - normally web UI is on 8088.) On Mon, Mar 2, 2015 at 4:14 PM, Anupama Joshi anupama.jo...@gmail.com wrote: Hi Marcelo, Thanks for the quick reply. I have a EMR cluster and I am running the spark-submit on the master node in the

Re: Spark UI and running spark-submit with --master yarn

2015-03-02 Thread Ted Yu
Default RM Web UI port is 8088 (configurable through yarn.resourcemanager.webapp.address) Cheers On Mon, Mar 2, 2015 at 4:14 PM, Anupama Joshi anupama.jo...@gmail.com wrote: Hi Marcelo, Thanks for the quick reply. I have a EMR cluster and I am running the spark-submit on the master node in

Re: Spark UI and running spark-submit with --master yarn

2015-03-02 Thread Marcelo Vanzin
What are you calling masternode? In yarn-cluster mode, the driver is running somewhere in your cluster, not on the machine where you run spark-submit. The easiest way to get to the Spark UI when using Yarn is to use the Yarn RM's web UI. That will give you a link to the application's UI

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Pat Ferrel
Sab, not sure what you require for the similarity metric or your use case but you can also look at spark-rowsimilarity or spark-itemsimilarity (column-wise) here http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

RE: Performance tuning in Spark SQL.

2015-03-02 Thread Abhishek Dubey
Hi, Thank you for your reply. It surely going to help. Regards, Abhishek Dubey From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 6:52 PM To: Abhishek Dubey; user@spark.apache.org Subject: RE: Performance tuning in Spark SQL. This is actually a quite open question,

Re: Spark UI and running spark-submit with --master yarn

2015-03-02 Thread Anupama Joshi
Hi Marcelo, Thanks for the quick reply. I have a EMR cluster and I am running the spark-submit on the master node in the cluster. When I start the spark-submit , I see 15/03/02 23:48:33 INFO client.RMProxy: Connecting to ResourceManager at / 172.31.43.254:9022 But If I try that URL or the use the

Re: Spark UI and running spark-submit with --master yarn

2015-03-02 Thread Marcelo Vanzin
That does not look like the RM UI. Please check your configuration for the port (see Ted's e-mail). On Mon, Mar 2, 2015 at 4:45 PM, Anupama Joshi anupama.jo...@gmail.com wrote: Hi , port 8088 does not show me anything .(can not connect) where as port

Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077

2015-03-02 Thread Krish Khambadkone
There is no output after this line Sent from my iPhone On Mar 2, 2015, at 2:40 PM, Ted Yu yuzhih...@gmail.com wrote: bq. Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077 There should be some more output following the above line. Can

Problems running version 1.3.0-rc1

2015-03-02 Thread Yiannis Gkoufas
Hi all, I have downloaded version 1.3.0-rc1 from https://github.com/apache/spark/archive/v1.3.0-rc1.zip, extracted it and built it using: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -DskipTests clean package It doesn't complain for any issues, but when I call sbin/start-all.sh I get on logs:

RE: Is SQLContext thread-safe?

2015-03-02 Thread Haopu Wang
Thanks for the response. Then I have another question: when will we want to create multiple SQLContext instances from the same SparkContext? What's the benefit? -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang; user

RE: Is SQLContext thread-safe?

2015-03-02 Thread Cheng, Hao
Currently, each SQLContext has its own configuration, e.g. shuffle partition number, codegen etc. and it will be shared among the multiple threads running. We actually has some internal discussions on this, probably will provide a thread local configuration in the future for a single SQLContext

Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077

2015-03-02 Thread Ted Yu
In AkkaUtils.scala: val akkaLogLifecycleEvents = conf.getBoolean(spark.akka.logLifecycleEvents, false) Can you turn on life cycle event logging to see if you would get some more clue ? Cheers On Mon, Mar 2, 2015 at 3:56 PM, Krishnanand Khambadkone kkhambadk...@yahoo.com wrote: I see

RE: Is SQLContext thread-safe?

2015-03-02 Thread Haopu Wang
Hao, thank you so much for the reply! Do you already have some JIRA for the discussion? -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Tuesday, March 03, 2015 8:23 AM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Currently, each SQLContext has its

Re: throughput in the web console?

2015-03-02 Thread Saiph Kappa
I performed repartitioning and everything went fine with respect to the number of CPU cores being used (and respective times). However, I noticed something very strange: inside a map operation I was doing a very simple calculation and always using the same dataset (small enough to be entirely

Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077

2015-03-02 Thread Krishnanand Khambadkone
I see these messages now, spark.master - spark://krishs-mbp:7077 Classpath elements: Sending launch command to spark://krishs-mbp:7077 Driver successfully submitted as driver-20150302155433- ... waiting before polling master for driver state ... polling master for driver state State of

Re: Problem getting program to run on 15TB input

2015-03-02 Thread Arun Luthra
Everything works smoothly if I do the 99%-removal filter in Hive first. So, all the baggage from garbage collection was breaking it. Is there a way to filter() out 99% of the data without having to garbage collect 99% of the RDD? On Sun, Mar 1, 2015 at 9:56 AM, Arun Luthra arun.lut...@gmail.com

Re: SparkSQL Timestamp query failure

2015-03-02 Thread anu
Thank you Alessandro :) On Tue, Mar 3, 2015 at 10:03 AM, whitebread [via Apache Spark User List] ml-node+s1001560n2188...@n3.nabble.com wrote: Anu, 1) I defined my class Header as it follows: case class Header(timestamp: java.sql.Timestamp, c_ip: String, cs_username: String, s_ip: String,

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Sabarish Sasidharan
Thanks Debasish, Reza and Pat. In my case, I am doing an SVD and then doing the similarities computation. So a rowSimiliarities() would be a good fit, looking forward to it. In the meanwhile I will try to see if I can further limit the number of similarities computed through some other fashion or

LBGFS optimizer performace

2015-03-02 Thread Gustavo Enrique Salazar Torres
Hi there: I'm using LBFGS optimizer to train a logistic regression model. The code I implemented follows the pattern showed in https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but training data is obtained from a Spark SQL RDD. The problem I'm having is that LBFGS tries to count the

Re: LBGFS optimizer performace

2015-03-02 Thread Akhil Das
Can you try increasing your driver memory, reducing the executors and increasing the executor memory? Thanks Best Regards On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres gsala...@ime.usp.br wrote: Hi there: I'm using LBFGS optimizer to train a logistic regression model. The

Re: Exception while select into table.

2015-03-02 Thread Yi Tian
Hi, Some suggestions: 1 You should tell us the version of spark and hive you are using. 2 You shoul paste the full trace stack of the exception. In this case, I guess you have a nested directory in the path which |bak_startup_log_uid_20150227| point to. and the config field

Re: unsafe memory access in spark 1.2.1

2015-03-02 Thread Akhil Das
Not sure, but It could be related to th netty off heap access as described here https://issues.apache.org/jira/browse/SPARK-4516, but the message was different though. Thanks Best Regards On Mon, Mar 2, 2015 at 12:51 AM, Zalzberg, Idan (Agoda) idan.zalzb...@agoda.com wrote: Thanks, We

RE: Spark SQL Thrift Server start exception : java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

2015-03-02 Thread Cheng, Hao
Copy those jars into the $SPARK_HOME/lib/ datanucleus-api-jdo-3.2.6.jar datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar see https://github.com/apache/spark/blob/master/bin/compute-classpath.sh#L120 -Original Message- From: fanooos [mailto:dev.fano...@gmail.com] Sent: Tuesday,

how to clean shuffle write each iteration

2015-03-02 Thread lisendong
I 'm using spark als. I set the iteration number to 30. And in each iteration, tasks will produce nearly 1TB shuffle write. To my surprise, this shuffle data will not be cleaned until the total job finished, which means, I need 30TB disk to store the shuffle data. I think after each

Spark SQL Thrift Server start exception : java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

2015-03-02 Thread fanooos
I have installed a hadoop cluster (version : 2.6.0), apache spark (version : 1.2.1 preBuilt for hadoop 2.4 and later), and hive (version 1.0.0). When I try to start the spark sql thrift server I am getting the following exception. Exception in thread main java.lang.RuntimeException:

Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-02 Thread shahab
Hi, According to Spark SQL documentation, Spark SQL supports the vast majority of Hive features, such as User Defined Functions( UDF) , and one of these UFDs is current_date() function, which should be supported. However, i get error when I am using this UDF in my SQL query. There are

Re: Architecture of Apache Spark SQL

2015-03-02 Thread Akhil Das
Here's the whole tech stack around it: [image: Inline image 1] For a bit more details you can refer this slide http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=1 Previous project was Shark (SQL over spark), you can read about it from here

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-02 Thread Joseph Bradley
I see, thanks for clarifying! I'd recommend following existing implementations in spark.ml transformers. You'll need to define a UDF which operates on a single Row to compute the value for the new column. You can then use the DataFrame DSL to create the new column; the DSL provides a nice syntax

Re: Executing hive query from Spark code

2015-03-02 Thread Felix C
It should work in CDH without having to recompile. http://eradiating.wordpress.com/2015/02/22/getting-hivecontext-to-work-in-cdh/ --- Original Message --- From: Ted Yu yuzhih...@gmail.com Sent: March 2, 2015 1:35 PM To: nitinkak001 nitinkak...@gmail.com Cc: user user@spark.apache.org Subject:

RE: Executing hive query from Spark code

2015-03-02 Thread Cheng, Hao
I am not so sure how Spark SQL compiled in CDH, but if didn’t specify the –Phive and –Phive-thriftserver flags during the build, most likely it will not work if just by providing the Hive lib jars later on. For example, does the HiveContext class exist in the assembly jar? I am also quite

Re: External Data Source in Spark

2015-03-02 Thread Akhil Das
Wouldn't it be possible with .saveAsNewHadoopAPIFile? How are you pushing the filters and projections currently? Thanks Best Regards On Tue, Mar 3, 2015 at 1:11 AM, Addanki, Santosh Kumar santosh.kumar.adda...@sap.com wrote: Hi Colleagues, Currently we have implemented External Data

Re: Architecture of Apache Spark SQL

2015-03-02 Thread Michael Armbrust
Here is a description of the optimizer: https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit On Mon, Mar 2, 2015 at 10:18 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Here's the whole tech stack around it: [image: Inline image 1] For a

RE: Is SQLContext thread-safe?

2015-03-02 Thread Cheng, Hao
https://issues.apache.org/jira/browse/SPARK-2087 https://github.com/apache/spark/pull/4382 I am working on the prototype, but will be updated soon. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Tuesday, March 3, 2015 8:32 AM To: Cheng, Hao; user Subject: RE: Is

Exception while select into table.

2015-03-02 Thread LinQili
Hi all,I was doing select using spark sql like: insert into table startup_log_uid_20150227select * from bak_startup_log_uid_20150227where login_time 1425027600 Usually, it got a exception:

Re: Performance tuning in Spark SQL.

2015-03-02 Thread Stephen Boesch
You have sent four questions that are very general in nature. They might be better answered if you googled for those topics: there is a wealth of materials available. 2015-03-02 2:01 GMT-08:00 dubey_a abhishek.du...@xoriant.com: What are the ways to tune query performance in Spark SQL? --

Executing hive query from Spark code

2015-03-02 Thread nitinkak001
I want to run Hive query inside Spark and use the RDDs generated from that inside Spark. I read in the documentation /Hive support is enabled by adding the -Phive and -Phive-thriftserver flags to Spark’s build. This command builds a new assembly jar that includes Hive. Note that this Hive

Dataframe v/s SparkSQL

2015-03-02 Thread Manoj Samel
Is it correct to say that Spark Dataframe APIs are implemented using same execution as SparkSQL ? In other words, while the dataframe API is different than SparkSQL, the runtime performance of equivalent constructs in Dataframe and SparkSQL should be same. So one should be able to choose whichever

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-02 Thread Arun Luthra
I think this is a Java vs scala syntax issue. Will check. On Thu, Feb 26, 2015 at 8:17 PM, Arun Luthra arun.lut...@gmail.com wrote: Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949 I tried this as a workaround: import org.apache.spark.scheduler._ import

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Reza Zadeh
Hi Sab, The current method is optimized for having many rows and few columns. In your case it is exactly the opposite. We are working on your case, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 Your case is very common, so I will put some time into building it. In the

Re: Store DStreams into Hive using Hive Streaming

2015-03-02 Thread tarek_abouzeid
please if you have found a solution for this , could you please post it ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Store-DStreams-into-Hive-using-Hive-Streaming-tp18307p21877.html Sent from the Apache Spark User List mailing list archive at

Re: multiple sparkcontexts and streamingcontexts

2015-03-02 Thread Tamas Jambor
thanks for the reply. Actually, our main problem is not really about sparkcontext, the problem is that spark does not allow to create streaming context dynamically, and once a stream is shut down, a new one cannot be created in the same sparkcontext. So we cannot create a service that would

RE: Performance tuning in Spark SQL.

2015-03-02 Thread Cheng, Hao
This is actually a quite open question, from my understanding, there're probably ways to tune like: *SQL Configurations like: Configuration Key Default Value spark.sql.autoBroadcastJoinThreshold 10 * 1024 * 1024 spark.sql.defaultSizeInBytes 10 * 1024 * 1024 + 1

Re: multiple sparkcontexts and streamingcontexts

2015-03-02 Thread Sean Owen
I think everything there is to know about it is on JIRA; I don't think that's being worked on. On Mon, Mar 2, 2015 at 2:50 PM, Tamas Jambor jambo...@gmail.com wrote: I have seen there is a card (SPARK-2243) to enable that. Is that still going ahead? On Mon, Mar 2, 2015 at 2:46 PM, Sean Owen

Re: multiple sparkcontexts and streamingcontexts

2015-03-02 Thread Sean Owen
It is still not something you're supposed to do; in fact there is a setting (disabled by default) that throws an exception if you try to make multiple contexts. On Mon, Mar 2, 2015 at 2:43 PM, jamborta jambo...@gmail.com wrote: hi all, what is the current status and direction on enabling

Re: multiple sparkcontexts and streamingcontexts

2015-03-02 Thread Tamas Jambor
I have seen there is a card (SPARK-2243) to enable that. Is that still going ahead? On Mon, Mar 2, 2015 at 2:46 PM, Sean Owen so...@cloudera.com wrote: It is still not something you're supposed to do; in fact there is a setting (disabled by default) that throws an exception if you try to make

Performance tuning in Spark SQL.

2015-03-02 Thread dubey_a
What are the ways to tune query performance in Spark SQL? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-tuning-in-Spark-SQL-tp21871.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

SQL Queries running on Schema RDD's in Spark SQL

2015-03-02 Thread dubey_a
How does the SQL queries really break down across nodes and run on Schema RDD's in background? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Queries-running-on-Schema-RDD-s-in-Spark-SQL-tp21870.html Sent from the Apache Spark User List mailing list

Re: Number of Executors per worker process

2015-03-02 Thread Spico Florin
Hello! Thank you very much for your response. In the book Learning Spark I found out the following sentence: Each application will have at most one executor on each worker So worker can have one or none executor process spawned (perhaps the number depends on the workload distribution). Best

RE: Is SQLContext thread-safe?

2015-03-02 Thread Cheng, Hao
Yes it is thread safe, at least it's supposed to be. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4:43 PM To: user Subject: Is SQLContext thread-safe? Hi, is it safe to use the same SQLContext to do Select operations in different threads

Is SQLContext thread-safe?

2015-03-02 Thread Haopu Wang
Hi, is it safe to use the same SQLContext to do Select operations in different threads at the same time? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Combiners in Spark

2015-03-02 Thread Guillermo Ortiz
Which is the equivalent function to Combiners of MapReduce in Spark? I guess that it's combineByKey, but is combineByKey executed locally? I understand than functions as reduceByKey or foldByKey aren't executed locally. Reading the documentation looks like combineByKey is equivalent to

Re: documentation - graphx-programming-guide error?

2015-03-02 Thread Sean Owen
You are correct in that the type of messages being sent in that example is String and so reduceFun must operate on String. Being just an example, it can do any reasonable combining of messages. How about a + + b? Or the message could be changed to an Int. The mapReduceTriplets example above

Re: SparkSQL Timestamp query failure

2015-03-02 Thread anu
Can you please post how did you overcome this issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p21868.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Best practices for query creation in Spark SQL.

2015-03-02 Thread dubey_a
Are there any best practices for schema design and query creation in Spark SQL? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-for-query-creation-in-Spark-SQL-tp21872.html Sent from the Apache Spark User List mailing list archive at

Re: Connection pool in workers

2015-03-02 Thread A.K.M. Ashrafuzzaman
Thanks Chris, That is what I wanted to know :) A.K.M. Ashrafuzzaman Lead Software Engineer NewsCred (M) 880-175-5592433 Twitter | Blog | Facebook Check out The Academy, your #1 source for free content marketing resources On Mar 2, 2015, at 2:04 AM, Chris Fregly ch...@fregly.com wrote: hey

Architecture of Apache Spark SQL

2015-03-02 Thread dubey_a
What is the architecture of Apache Spark SQL? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Architecture-of-Apache-Spark-SQL-tp21869.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Best practices for query creation in Spark SQL.

2015-03-02 Thread Tobias Pfeiffer
Hi, I think your chances for a satisfying answer would increase dramatically if you elaborated a bit more on what you actually want to know. (Holds for any of your last four questions about Spark SQL...) Tobias

GraphX path traversal

2015-03-02 Thread Madabhattula Rajesh Kumar
Hi, I have a below edge list. How to find the parents path for every vertex? Example : Vertex 1 path : 2, 3, 4, 5, 6 Vertex 2 path : 3, 4, 5, 6 Vertex 3 path : 4,5,6 vertex 4 path : 5,6 vertex 5 path : 6 Could you please let me know how to do this? (or) Any suggestion Source Vertex

Re: Combiners in Spark

2015-03-02 Thread Sean Owen
I think the simplest answer is that it's not really a separate concept from the 'reduce' function, because Spark's API is a sort of simpler, purer form of FP. It is just the same function that can be applied at many points in an aggregation -- both map side (a la Combiners in MapReduce) or reduce

Re: multiple sparkcontexts and streamingcontexts

2015-03-02 Thread Nan Zhu
there are some “hidden” APIs potentially addressing your problem (but with a bit complexity) by using the Actor Receiver, you can tell the supervisor of the actor receiver create another actor receiver for you, the ActorRef of the newly created Actor will be sent to the caller of the API (in

Re: Scalable JDBCRDD

2015-03-02 Thread Cody Koeninger
Have you already tried using the Vertica hadoop input format with spark? I don't know how it's implemented, but I'd hope that it has some notion of vertica-specific shard locality (which JdbcRDD does not). If you're really constrained to consuming the result set in a single thread, whatever

Re: multiple sparkcontexts and streamingcontexts

2015-03-02 Thread Sean Owen
You can make a new StreamingContext on an existing SparkContext, I believe? On Mon, Mar 2, 2015 at 3:01 PM, Tamas Jambor jambo...@gmail.com wrote: thanks for the reply. Actually, our main problem is not really about sparkcontext, the problem is that spark does not allow to create streaming

Re: multiple sparkcontexts and streamingcontexts

2015-03-02 Thread Tamas Jambor
Sorry, I meant once the stream is started, it's not possible to create new streams in the existing streaming context, and it's not possible to create new streaming context if another one is already running. So the only feasible option seemed to create new sparkcontexts for each stream (tried using

Re: Scalable JDBCRDD

2015-03-02 Thread Michal Klos
Hi Cody, Thanks for the reply. Yea, we thought of possibly doing this in a UDX in Vertica somehow to get the lower level co-operation but its a bit daunting. We want to do this because there are things we want to do with the result-set in Spark that are not possible in Vertica. The DStream

Re: java.util.NoSuchElementException: key not found:

2015-03-02 Thread Rok Roskar
aha ok, thanks. If I create different RDDs from a parent RDD and force evaluation thread-by-thread, then it should presumably be fine, correct? Or do I need to checkpoint the child RDDs as a precaution in case it needs to be removed from memory and recomputed? On Sat, Feb 28, 2015 at 4:28 AM,

Re: bitten by spark.yarn.executor.memoryOverhead

2015-03-02 Thread Ted Yu
bq. that 0.1 is always enough? The answer is: it depends (on use cases). The value of 0.1 has been validated by several users. I think it is a reasonable default. Cheers On Mon, Mar 2, 2015 at 8:36 AM, Ryan Williams ryan.blake.willi...@gmail.com wrote: For reference, the initial version of

Re: Upgrade to Spark 1.2.1 using Guava

2015-03-02 Thread Pat Ferrel
Marcelo’s work-around works. So if you are using the itemsimilarity stuff, the CLI has a way to solve the class not found and I can point out how to do the equivalent if you are using the library API. Ping me if you care. On Feb 28, 2015, at 2:27 PM, Erlend Hamnaberg erl...@hamnaberg.net

Re: bitten by spark.yarn.executor.memoryOverhead

2015-03-02 Thread Sean Owen
The problem is, you're left with two competing options then. You can go through the process of deprecating the absolute one and removing it eventually. You take away ability to set this value directly though, meaning you'd have to set absolute values by depending on a % of what you set your app

Re: Is SparkSQL optimizer aware of the needed data after the query?

2015-03-02 Thread Michael Armbrust
-dev +user No, lambda functions and other code are black-boxes to Spark SQL. If you want those kinds of optimizations you need to express the columns required in either SQL or the DataFrame DSL (coming in 1.3). On Mon, Mar 2, 2015 at 1:55 AM, Wail w.alkowail...@cces-kacst-mit.org wrote:

Re: SparkSQL production readiness

2015-03-02 Thread Michael Armbrust
We have been using Spark SQL in production for our customers at Databricks for almost a year now. We also know of some very large production deployments elsewhere. It is still a young project, but I wouldn't call it alpha. The primary changes to the API are the addition of the DataFrame

Re: Is SPARK_CLASSPATH really deprecated?

2015-03-02 Thread Marcelo Vanzin
Just a note for whoever writes the doc, spark.executor.extraClassPath is *prepended* to the executor's classpath, which is a rather important distinction. :-) On Fri, Feb 27, 2015 at 12:21 AM, Patrick Wendell pwend...@gmail.com wrote: I think we need to just update the docs, it is a bit unclear

Issues reading in Json file with spark sql

2015-03-02 Thread kpeng1
Hi All, I am currently having issues reading in a json file using spark sql's api. Here is what the json file looks like: { namespace: spacey, name: namer, type: record, fields: [ {name:f1,type:[null,string]}, {name:f2,type:[null,string]}, {name:f3,type:[null,string]},

Re: Issues reading in Json file with spark sql

2015-03-02 Thread Yin Huai
Is the string of the above JSON object in the same line? jsonFile requires that every line is a JSON object or an array of JSON objects. On Mon, Mar 2, 2015 at 11:28 AM, kpeng1 kpe...@gmail.com wrote: Hi All, I am currently having issues reading in a json file using spark sql's api. Here is

Re: What joda-time dependency does spark submit use/need?

2015-03-02 Thread Su She
Hi Todd, So I am already specifying joda-time-2.7 (have tried 2.2, 2.3, 2.6, 2.7) in the --jars option. I tried using the joda-time bundle jar ( http://mvnrepository.com/artifact/org.apache.servicemix.bundles/org.apache.servicemix.bundles.joda-time/2.3_1) which comes with joda-convert. I know

External Data Source in Spark

2015-03-02 Thread Addanki, Santosh Kumar
Hi Colleagues, Currently we have implemented External Data Source API and are able to push filters and projections. Could you provide some info on how perhaps the joins could be pushed to the original Data Source if both the data sources are from same database Briefly looked at

Re: Issues reading in Json file with spark sql

2015-03-02 Thread Emre Sevinc
According to Spark SQL Programming Guide: jsonFile - loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file that is offered as jsonFile is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a