OK, good to know data frames are still experimental. Thanks Michael.
On Mon, Mar 2, 2015 at 12:37 PM, Michael Armbrust mich...@databricks.com
wrote:
We have been using Spark SQL in production for our customers at Databricks
for almost a year now. We also know of some very large production
What's your actual code? that can't compile since groupBy would return
a JavaPairRDD.
I tried compiling that (after changing to void type) with Java 7 and
Java 8 (meaning, not just the JDK but compiling for the language level
too) and both worked.
On Mon, Mar 2, 2015 at 10:03 PM, btiernay
bq. Cause was: akka.remote.InvalidAssociation: Invalid address:
akka.tcp://sparkMaster@localhost:7077
There should be some more output following the above line.
Can you post them ?
Cheers
On Mon, Mar 2, 2015 at 2:06 PM, Krishnanand Khambadkone
kkhambadk...@yahoo.com.invalid wrote:
Hi, I am
They are the same. These are just different ways to construct catalyst
logical plans.
On Mon, Mar 2, 2015 at 12:50 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Is it correct to say that Spark Dataframe APIs are implemented using same
execution as SparkSQL ? In other words, while the
Seem like upgrading to 1.2.0 fixed the error.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/JavaRDD-method-ambiguous-after-upgrading-to-Java-8-tp21882p21883.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
The following method demonstrates the issue:
private static Tuple2String, String group(JavaPairRDDString, String
rdd, FunctionTuple2lt;String, String, String f) {
return rdd.groupBy(f);
}
I get the following compilation error using Spark 1.1.1 and Java 8u31:
The method
Hi, I am running spark on my mac. It is reading from a kafka topic and then
writes the data to a hbase table. When I do a spark submit, I get this error,
Error connecting to master spark://localhost:7077
(akka.tcp://sparkMaster@localhost:7077), exiting.
Cause was:
Hi all,
I didn't find the *issues* button on
https://github.com/datastax/spark-cassandra-connector/ so posting here.
Any one have an idea why token ranges are grouped into one partition per
executor? I expected at least one per core. Any suggestions on how to work
around this? Doing a
This is the line,
Error connecting to master spark://localhost:7077
(akka.tcp://sparkMaster@localhost:7077), exiting.
On Monday, March 2, 2015 2:42 PM, Ted Yu yuzhih...@gmail.com wrote:
bq. Cause was: akka.remote.InvalidAssociation: Invalid address:
I ran it with the --verbose option and I see this output
Using properties file: null
Parsed arguments:
master spark://localhost:7077
deployMode cluster
executorMemory 1g
executorCores null
totalExecutorCores null
propertiesFile
Here is snippet of dependency tree for spark-hive module:
[INFO] org.apache.spark:spark-hive_2.10:jar:1.3.0-SNAPSHOT
...
[INFO] +- org.spark-project.hive:hive-metastore:jar:0.13.1a:compile
[INFO] | +- org.spark-project.hive:hive-shims:jar:0.13.1a:compile
[INFO] | | +-
Hi ,
1. When I run my application with --master yarn-cluster or --master
yarn --deploy-mode cluster , I can not the spark UI at the location --
masternode:4040Even if I am running the job , I can not see teh SPARK UI.
2. When I run with --master yarn --deploy-mode client -- I see
That's the RM's RPC port, not the web UI port. (See Ted's e-mail -
normally web UI is on 8088.)
On Mon, Mar 2, 2015 at 4:14 PM, Anupama Joshi anupama.jo...@gmail.com wrote:
Hi Marcelo,
Thanks for the quick reply.
I have a EMR cluster and I am running the spark-submit on the master node in
the
Default RM Web UI port is 8088 (configurable
through yarn.resourcemanager.webapp.address)
Cheers
On Mon, Mar 2, 2015 at 4:14 PM, Anupama Joshi anupama.jo...@gmail.com
wrote:
Hi Marcelo,
Thanks for the quick reply.
I have a EMR cluster and I am running the spark-submit on the master node
in
What are you calling masternode? In yarn-cluster mode, the driver
is running somewhere in your cluster, not on the machine where you run
spark-submit.
The easiest way to get to the Spark UI when using Yarn is to use the
Yarn RM's web UI. That will give you a link to the application's UI
Sab, not sure what you require for the similarity metric or your use case but
you can also look at spark-rowsimilarity or spark-itemsimilarity (column-wise)
here http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
Hi,
Thank you for your reply. It surely going to help.
Regards,
Abhishek Dubey
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Monday, March 02, 2015 6:52 PM
To: Abhishek Dubey; user@spark.apache.org
Subject: RE: Performance tuning in Spark SQL.
This is actually a quite open question,
Hi Marcelo,
Thanks for the quick reply.
I have a EMR cluster and I am running the spark-submit on the master node
in the cluster.
When I start the spark-submit , I see
15/03/02 23:48:33 INFO client.RMProxy: Connecting to ResourceManager at /
172.31.43.254:9022
But If I try that URL or the use the
That does not look like the RM UI. Please check your configuration for
the port (see Ted's e-mail).
On Mon, Mar 2, 2015 at 4:45 PM, Anupama Joshi anupama.jo...@gmail.com wrote:
Hi ,
port 8088 does not show me anything .(can not connect)
where as port
There is no output after this line
Sent from my iPhone
On Mar 2, 2015, at 2:40 PM, Ted Yu yuzhih...@gmail.com wrote:
bq. Cause was: akka.remote.InvalidAssociation: Invalid address:
akka.tcp://sparkMaster@localhost:7077
There should be some more output following the above line.
Can
Hi all,
I have downloaded version 1.3.0-rc1 from
https://github.com/apache/spark/archive/v1.3.0-rc1.zip, extracted it and
built it using:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -DskipTests clean package
It doesn't complain for any issues, but when I call sbin/start-all.sh I get
on logs:
Thanks for the response.
Then I have another question: when will we want to create multiple
SQLContext instances from the same SparkContext? What's the benefit?
-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Monday, March 02, 2015 9:05 PM
To: Haopu Wang; user
Currently, each SQLContext has its own configuration, e.g. shuffle partition
number, codegen etc. and it will be shared among the multiple threads running.
We actually has some internal discussions on this, probably will provide a
thread local configuration in the future for a single SQLContext
In AkkaUtils.scala:
val akkaLogLifecycleEvents =
conf.getBoolean(spark.akka.logLifecycleEvents, false)
Can you turn on life cycle event logging to see if you would get some more
clue ?
Cheers
On Mon, Mar 2, 2015 at 3:56 PM, Krishnanand Khambadkone
kkhambadk...@yahoo.com wrote:
I see
Hao, thank you so much for the reply!
Do you already have some JIRA for the discussion?
-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Tuesday, March 03, 2015 8:23 AM
To: Haopu Wang; user
Subject: RE: Is SQLContext thread-safe?
Currently, each SQLContext has its
I performed repartitioning and everything went fine with respect to the
number of CPU cores being used (and respective times). However, I noticed
something very strange: inside a map operation I was doing a very simple
calculation and always using the same dataset (small enough to be entirely
I see these messages now,
spark.master - spark://krishs-mbp:7077
Classpath elements:
Sending launch command to spark://krishs-mbp:7077
Driver successfully submitted as driver-20150302155433-
... waiting before polling master for driver state
... polling master for driver state
State of
Everything works smoothly if I do the 99%-removal filter in Hive first. So,
all the baggage from garbage collection was breaking it.
Is there a way to filter() out 99% of the data without having to garbage
collect 99% of the RDD?
On Sun, Mar 1, 2015 at 9:56 AM, Arun Luthra arun.lut...@gmail.com
Thank you Alessandro :)
On Tue, Mar 3, 2015 at 10:03 AM, whitebread [via Apache Spark User List]
ml-node+s1001560n2188...@n3.nabble.com wrote:
Anu,
1) I defined my class Header as it follows:
case class Header(timestamp: java.sql.Timestamp, c_ip: String,
cs_username: String, s_ip: String,
Thanks Debasish, Reza and Pat. In my case, I am doing an SVD and then doing
the similarities computation. So a rowSimiliarities() would be a good fit,
looking forward to it.
In the meanwhile I will try to see if I can further limit the number of
similarities computed through some other fashion or
Hi there:
I'm using LBFGS optimizer to train a logistic regression model. The code I
implemented follows the pattern showed in
https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but training
data is obtained from a Spark SQL RDD.
The problem I'm having is that LBFGS tries to count the
Can you try increasing your driver memory, reducing the executors and
increasing the executor memory?
Thanks
Best Regards
On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres
gsala...@ime.usp.br wrote:
Hi there:
I'm using LBFGS optimizer to train a logistic regression model. The
Hi,
Some suggestions:
1 You should tell us the version of spark and hive you are using.
2 You shoul paste the full trace stack of the exception.
In this case, I guess you have a nested directory in the path which
|bak_startup_log_uid_20150227| point to.
and the config field
Not sure, but It could be related to th netty off heap access as described
here https://issues.apache.org/jira/browse/SPARK-4516, but the message was
different though.
Thanks
Best Regards
On Mon, Mar 2, 2015 at 12:51 AM, Zalzberg, Idan (Agoda)
idan.zalzb...@agoda.com wrote:
Thanks,
We
Copy those jars into the $SPARK_HOME/lib/
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
see https://github.com/apache/spark/blob/master/bin/compute-classpath.sh#L120
-Original Message-
From: fanooos [mailto:dev.fano...@gmail.com]
Sent: Tuesday,
I 'm using spark als.
I set the iteration number to 30.
And in each iteration, tasks will produce nearly 1TB shuffle write.
To my surprise, this shuffle data will not be cleaned until the total job
finished, which means, I need 30TB disk to store the shuffle data.
I think after each
I have installed a hadoop cluster (version : 2.6.0), apache spark (version :
1.2.1 preBuilt for hadoop 2.4 and later), and hive (version 1.0.0).
When I try to start the spark sql thrift server I am getting the following
exception.
Exception in thread main java.lang.RuntimeException:
Hi,
According to Spark SQL documentation, Spark SQL supports the vast
majority of Hive features, such as User Defined Functions( UDF) , and one
of these UFDs is current_date() function, which should be supported.
However, i get error when I am using this UDF in my SQL query. There are
Here's the whole tech stack around it:
[image: Inline image 1]
For a bit more details you can refer this slide
http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=1
Previous project was Shark (SQL over spark), you can read about it from
here
I see, thanks for clarifying!
I'd recommend following existing implementations in spark.ml transformers.
You'll need to define a UDF which operates on a single Row to compute the
value for the new column. You can then use the DataFrame DSL to create the
new column; the DSL provides a nice syntax
It should work in CDH without having to recompile.
http://eradiating.wordpress.com/2015/02/22/getting-hivecontext-to-work-in-cdh/
--- Original Message ---
From: Ted Yu yuzhih...@gmail.com
Sent: March 2, 2015 1:35 PM
To: nitinkak001 nitinkak...@gmail.com
Cc: user user@spark.apache.org
Subject:
I am not so sure how Spark SQL compiled in CDH, but if didn’t specify the
–Phive and –Phive-thriftserver flags during the build, most likely it will not
work if just by providing the Hive lib jars later on. For example, does the
HiveContext class exist in the assembly jar?
I am also quite
Wouldn't it be possible with .saveAsNewHadoopAPIFile? How are you pushing
the filters and projections currently?
Thanks
Best Regards
On Tue, Mar 3, 2015 at 1:11 AM, Addanki, Santosh Kumar
santosh.kumar.adda...@sap.com wrote:
Hi Colleagues,
Currently we have implemented External Data
Here is a description of the optimizer:
https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit
On Mon, Mar 2, 2015 at 10:18 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Here's the whole tech stack around it:
[image: Inline image 1]
For a
https://issues.apache.org/jira/browse/SPARK-2087
https://github.com/apache/spark/pull/4382
I am working on the prototype, but will be updated soon.
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Tuesday, March 3, 2015 8:32 AM
To: Cheng, Hao; user
Subject: RE: Is
Hi all,I was doing select using spark sql like:
insert into table startup_log_uid_20150227select * from
bak_startup_log_uid_20150227where login_time 1425027600
Usually, it got a exception:
You have sent four questions that are very general in nature. They might be
better answered if you googled for those topics: there is a wealth of
materials available.
2015-03-02 2:01 GMT-08:00 dubey_a abhishek.du...@xoriant.com:
What are the ways to tune query performance in Spark SQL?
--
I want to run Hive query inside Spark and use the RDDs generated from that
inside Spark. I read in the documentation
/Hive support is enabled by adding the -Phive and -Phive-thriftserver flags
to Spark’s build. This command builds a new assembly jar that includes Hive.
Note that this Hive
Is it correct to say that Spark Dataframe APIs are implemented using same
execution as SparkSQL ? In other words, while the dataframe API is
different than SparkSQL, the runtime performance of equivalent constructs
in Dataframe and SparkSQL should be same. So one should be able to choose
whichever
I think this is a Java vs scala syntax issue. Will check.
On Thu, Feb 26, 2015 at 8:17 PM, Arun Luthra arun.lut...@gmail.com wrote:
Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949
I tried this as a workaround:
import org.apache.spark.scheduler._
import
Hi Sab,
The current method is optimized for having many rows and few columns. In
your case it is exactly the opposite. We are working on your case, tracked
by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823
Your case is very common, so I will put some time into building it.
In the
please if you have found a solution for this , could you please post it ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Store-DStreams-into-Hive-using-Hive-Streaming-tp18307p21877.html
Sent from the Apache Spark User List mailing list archive at
thanks for the reply.
Actually, our main problem is not really about sparkcontext, the problem is
that spark does not allow to create streaming context dynamically, and once
a stream is shut down, a new one cannot be created in the same
sparkcontext. So we cannot create a service that would
This is actually a quite open question, from my understanding, there're
probably ways to tune like:
*SQL Configurations like:
Configuration Key
Default Value
spark.sql.autoBroadcastJoinThreshold
10 * 1024 * 1024
spark.sql.defaultSizeInBytes
10 * 1024 * 1024 + 1
I think everything there is to know about it is on JIRA; I don't think
that's being worked on.
On Mon, Mar 2, 2015 at 2:50 PM, Tamas Jambor jambo...@gmail.com wrote:
I have seen there is a card (SPARK-2243) to enable that. Is that still going
ahead?
On Mon, Mar 2, 2015 at 2:46 PM, Sean Owen
It is still not something you're supposed to do; in fact there is a
setting (disabled by default) that throws an exception if you try to
make multiple contexts.
On Mon, Mar 2, 2015 at 2:43 PM, jamborta jambo...@gmail.com wrote:
hi all,
what is the current status and direction on enabling
I have seen there is a card (SPARK-2243) to enable that. Is that still
going ahead?
On Mon, Mar 2, 2015 at 2:46 PM, Sean Owen so...@cloudera.com wrote:
It is still not something you're supposed to do; in fact there is a
setting (disabled by default) that throws an exception if you try to
make
What are the ways to tune query performance in Spark SQL?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-tuning-in-Spark-SQL-tp21871.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
How does the SQL queries really break down across nodes and run on Schema
RDD's in background?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Queries-running-on-Schema-RDD-s-in-Spark-SQL-tp21870.html
Sent from the Apache Spark User List mailing list
Hello!
Thank you very much for your response. In the book Learning Spark I
found out the following sentence:
Each application will have at most one executor on each worker
So worker can have one or none executor process spawned (perhaps the number
depends on the workload distribution).
Best
Yes it is thread safe, at least it's supposed to be.
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Monday, March 2, 2015 4:43 PM
To: user
Subject: Is SQLContext thread-safe?
Hi, is it safe to use the same SQLContext to do Select operations in different
threads
Hi, is it safe to use the same SQLContext to do Select operations in
different threads at the same time? Thank you very much!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
Which is the equivalent function to Combiners of MapReduce in Spark?
I guess that it's combineByKey, but is combineByKey executed locally?
I understand than functions as reduceByKey or foldByKey aren't executed locally.
Reading the documentation looks like combineByKey is equivalent to
You are correct in that the type of messages being sent in that
example is String and so reduceFun must operate on String. Being just
an example, it can do any reasonable combining of messages. How
about a + + b?
Or the message could be changed to an Int.
The mapReduceTriplets example above
Can you please post how did you overcome this issue.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p21868.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Are there any best practices for schema design and query creation in Spark
SQL?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-for-query-creation-in-Spark-SQL-tp21872.html
Sent from the Apache Spark User List mailing list archive at
Thanks Chris,
That is what I wanted to know :)
A.K.M. Ashrafuzzaman
Lead Software Engineer
NewsCred
(M) 880-175-5592433
Twitter | Blog | Facebook
Check out The Academy, your #1 source
for free content marketing resources
On Mar 2, 2015, at 2:04 AM, Chris Fregly ch...@fregly.com wrote:
hey
What is the architecture of Apache Spark SQL?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Architecture-of-Apache-Spark-SQL-tp21869.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I think your chances for a satisfying answer would increase dramatically if
you elaborated a bit more on what you actually want to know.
(Holds for any of your last four questions about Spark SQL...)
Tobias
Hi,
I have a below edge list. How to find the parents path for every vertex?
Example :
Vertex 1 path : 2, 3, 4, 5, 6
Vertex 2 path : 3, 4, 5, 6
Vertex 3 path : 4,5,6
vertex 4 path : 5,6
vertex 5 path : 6
Could you please let me know how to do this? (or) Any suggestion
Source Vertex
I think the simplest answer is that it's not really a separate concept
from the 'reduce' function, because Spark's API is a sort of simpler,
purer form of FP. It is just the same function that can be applied at
many points in an aggregation -- both map side (a la Combiners in
MapReduce) or reduce
there are some “hidden” APIs potentially addressing your problem (but with a
bit complexity)
by using the Actor Receiver, you can tell the supervisor of the actor receiver
create another actor receiver for you, the ActorRef of the newly created Actor
will be sent to the caller of the API (in
Have you already tried using the Vertica hadoop input format with spark? I
don't know how it's implemented, but I'd hope that it has some notion of
vertica-specific shard locality (which JdbcRDD does not).
If you're really constrained to consuming the result set in a single
thread, whatever
You can make a new StreamingContext on an existing SparkContext, I believe?
On Mon, Mar 2, 2015 at 3:01 PM, Tamas Jambor jambo...@gmail.com wrote:
thanks for the reply.
Actually, our main problem is not really about sparkcontext, the problem is
that spark does not allow to create streaming
Sorry, I meant once the stream is started, it's not possible to create new
streams in the existing streaming context, and it's not possible to create
new streaming context if another one is already running.
So the only feasible option seemed to create new sparkcontexts for each
stream (tried using
Hi Cody,
Thanks for the reply. Yea, we thought of possibly doing this in a UDX in
Vertica somehow to get the lower level co-operation but its a bit daunting.
We want to do this because there are things we want to do with the
result-set in Spark that are not possible in Vertica. The DStream
aha ok, thanks.
If I create different RDDs from a parent RDD and force evaluation
thread-by-thread, then it should presumably be fine, correct? Or do I need
to checkpoint the child RDDs as a precaution in case it needs to be removed
from memory and recomputed?
On Sat, Feb 28, 2015 at 4:28 AM,
bq. that 0.1 is always enough?
The answer is: it depends (on use cases).
The value of 0.1 has been validated by several users. I think it is a
reasonable default.
Cheers
On Mon, Mar 2, 2015 at 8:36 AM, Ryan Williams ryan.blake.willi...@gmail.com
wrote:
For reference, the initial version of
Marcelo’s work-around works. So if you are using the itemsimilarity stuff, the
CLI has a way to solve the class not found and I can point out how to do the
equivalent if you are using the library API. Ping me if you care.
On Feb 28, 2015, at 2:27 PM, Erlend Hamnaberg erl...@hamnaberg.net
The problem is, you're left with two competing options then. You can
go through the process of deprecating the absolute one and removing it
eventually. You take away ability to set this value directly though,
meaning you'd have to set absolute values by depending on a % of what
you set your app
-dev +user
No, lambda functions and other code are black-boxes to Spark SQL. If you
want those kinds of optimizations you need to express the columns required
in either SQL or the DataFrame DSL (coming in 1.3).
On Mon, Mar 2, 2015 at 1:55 AM, Wail w.alkowail...@cces-kacst-mit.org
wrote:
We have been using Spark SQL in production for our customers at Databricks
for almost a year now. We also know of some very large production
deployments elsewhere. It is still a young project, but I wouldn't call it
alpha.
The primary changes to the API are the addition of the DataFrame
Just a note for whoever writes the doc, spark.executor.extraClassPath
is *prepended* to the executor's classpath, which is a rather
important distinction. :-)
On Fri, Feb 27, 2015 at 12:21 AM, Patrick Wendell pwend...@gmail.com wrote:
I think we need to just update the docs, it is a bit unclear
Hi All,
I am currently having issues reading in a json file using spark sql's api.
Here is what the json file looks like:
{
namespace: spacey,
name: namer,
type: record,
fields: [
{name:f1,type:[null,string]},
{name:f2,type:[null,string]},
{name:f3,type:[null,string]},
Is the string of the above JSON object in the same line? jsonFile requires
that every line is a JSON object or an array of JSON objects.
On Mon, Mar 2, 2015 at 11:28 AM, kpeng1 kpe...@gmail.com wrote:
Hi All,
I am currently having issues reading in a json file using spark sql's api.
Here is
Hi Todd,
So I am already specifying joda-time-2.7 (have tried 2.2, 2.3, 2.6, 2.7) in
the --jars option. I tried using the joda-time bundle jar (
http://mvnrepository.com/artifact/org.apache.servicemix.bundles/org.apache.servicemix.bundles.joda-time/2.3_1)
which comes with joda-convert.
I know
Hi Colleagues,
Currently we have implemented External Data Source API and are able to push
filters and projections.
Could you provide some info on how perhaps the joins could be pushed to the
original Data Source if both the data sources are from same database
Briefly looked at
According to Spark SQL Programming Guide:
jsonFile - loads data from a directory of JSON files where each line of the
files is a JSON object.
Note that the file that is offered as jsonFile is not a typical JSON file.
Each line must contain a separate, self-contained valid JSON object. As a
88 matches
Mail list logo