Currently SparkSQL doesn’t support the row format/serde in CTAS. The work
around is create the table first.
-Original Message-
From: centerqi hu [mailto:cente...@gmail.com]
Sent: Tuesday, September 02, 2014 3:35 PM
To: user@spark.apache.org
Subject: Unsupported language features in
Thanks Cheng Hao
Have a way of obtaining spark support hive statement list?
Thanks
2014-09-02 15:39 GMT+08:00 Cheng, Hao hao.ch...@intel.com:
Currently SparkSQL doesn’t support the row format/serde in CTAS. The work
around is create the table first.
-Original Message-
From:
I am afraid no, but you can report that in Jira
(https://issues.apache.org/jira/browse/SPARK) if you meet the missing
functionalities in SparkSQL.
SparkSQL aims to support all of the Hive functionalities (at least most of it)
for HQL dialect.
-Original Message-
From: centerqi hu
is there any news about Discretization in spark?
is there anything on git? i didnt find yet
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/New-features-Discretization-for-v1-x-in-xiangrui-pdf-tp13256.html
Sent from the Apache Spark User List mailing list
I want to save schemardd to hive
val usermeta = hql( SELECT userid,idlist from usermeta WHERE
day='2014-08-01' limit 1000)
case class SomeClass(name:String,idlist:String)
val schemardd = usermeta.map(t={SomeClass(t(0).toString,t(1).toString)})
How to save schemardd to hive?
Thanks
--
i guess i found it
https://github.com/LIDIAgroup/SparkFeatureSelection
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/New-features-Discretization-for-v1-x-in-xiangrui-pdf-tp13256p13261.html
Sent from the Apache Spark User List mailing list archive at
I got it
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
val usermeta = hql( SELECT userid,idlist from meta WHERE
day='2014-08-01' limit 1000)
case class
You can use saveAsTable or do an INSERT SparkSQL statement as well in case
you need other Hive query features, like partitioning.
On 9/2/14, 6:54 AM, centerqi hu cente...@gmail.com wrote:
I got it
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
val hiveContext =
Thank you very much, also can do this, it seems that I know too little about RDD
在 2014年9月2日,21:22,Silvio Fiorito silvio.fior...@granturing.com 写道:
Once you’ve registered an RDD as a table, you can use it in SparkSQL
statements:
myrdd.registerAsTable(“my_table”)
hql(“FROM my_table
Trying to bump this -- I'm basically asking if anyone has noticed the
executor leaking memory.
I have a large key space but no churn in the RDD so I don't understand why
memory consumption grows with time.
Any experiences with streaming welcomed -- I'm hoping I'm doing something
wrong
--
Team,
I am new to Apache Spark and I didn't have much knowledge on hadoop or big
data. I need clarifications on the below,
How does Spark Configuration works, from a tutorial i got the below
/SparkConf conf = new SparkConf().setAppName(Simple application)
I don't have any personal experience with Spark Streaming. Whether you
store your data in HDFS or a database or something else probably depends on
the nature of your use case.
On Fri, Aug 29, 2014 at 10:38 AM, huylv huy.le...@insight-centre.org
wrote:
Hi Daniel,
Your suggestion is definitely
Sorry about the noob question, but I was just wondering if we use Spark's
ActorSystem (SparkEnv.actorSystem), would it distribute actors across
worker nodes or would the actors only run in driver JVM?
Hi Oleg,
If you are running Spark on a yarn cluster, you should set --master to
yarn. By default this runs in client mode, which redirects all output of
your application to your console. This is failing because it is trying to
connect to a standalone master that you probably did not start. I am
Spark-shell, or any other Spark application, returns the full results of
the job until it has finished executing. You could add a hook for it to
write partial results to a file, but you may want to do so sparingly to
incur fewer I/Os. If you have a large file and the result contains many
lines, it
JavaSparkContext java_SC = new JavaSparkContext(conf); is the spark
context. An application has a single spark context -- you won't be able to
keep calling this -- you'll see an error if you try to create a second
such object from the same application.
Additionally, depending on your
I'm working on setting up Spark on YARN using the HDP technical preview -
http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/
I have installed the Spark JARs on all the slave nodes and configured YARN to
find the JARs. It seems like everything is working.
Unless I'm
I’ve put my Spark JAR into HDFS, and specify the SPARK_JAR variable to point to
the HDFS location of the jar. I’m not using any specialized configuration
files (like spark-env.sh), but rather setting things either by environment
variable per node, passing application arguments to the job, or
Hi Greg,
You should not need to even manually install Spark on each of the worker
nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary
jars (i.e. the assembly + additional jars) to each of the containers for
you. You can specify additional jars that your application depends
Thanks. That sounds like how I was thinking it worked. I did have to install
the JARs on the slave nodes for yarn-cluster mode to work, FWIW. It's probably
just whichever node ends up spawning the application master that needs it, but
it wasn't passed along from spark-submit.
Greg
From:
Hi all,
I am getting started with spark and mesos, I already have spark running on
a mesos cluster and I am able to start the scala spark and pyspark shells,
yay!. I still have questions on how to distribute 3rd party python
libraries since i want to use stuff like nltk and mlib on pyspark that
Hello,
I’m using Spark streaming to aggregate data from a Kafka topic in sliding
windows. Usually we want to persist this aggregated data to a MongoDB cluster,
or republish to a different Kafka topic. When I include these 3rd party
drivers, I usually get a NotSerializableException due to the
As an update. I'm still getting the same issue. I ended up doing a coalesce
instead of a cache to get around the memory issue but saveAsTextFile still
won't proceed without the coalesce or cache first.
--
View this message in context:
+user@
An executor is specific to an application, but an application can be
executing many jobs at once. So as I understand many jobs' tasks can
be executing at once on an executor.
You may not use your full 80-way parallelism if, for example, your
data set doesn't have 80 partitions. I also
Hello,
Can someone enlighten me regarding whether call unpersist on a rdd is
expensive? what is the best solution to uncache the cached rdd?
Thanks
Edwin
Hello all,
after having applied several transformations to a DStream I'd like to
publish all the elements in all the resulting RDDs to Kafka. What the best
way to do that would be? Just using DStream.foreach and then RDD.foreach ?
Is there any other built in utility for this use case?
Thanks a
I'd be interested in finding the answer too. Right now, I do:
val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam))
kafkaOutMsgs.foreachRDD((rdd,time) = { rdd.foreach(rec = {
writer.output(rec) }) } ) //where writer.ouput is a method that takes a
string and writer is an instance of a
Hi All ,
Is it possible to have cassandra as input data for PySpark. I found
example for java -
http://java.dzone.com/articles/sparkcassandra-stack-perform?page=0,0 and I
am looking something similar for python.
Thanks
Oleg.
Hi,
I evaluated the runtime performance of some of the MLlib classification
algorithms on a local machine and a cluster with 10 nodes. I used standalone
mode and Spark 1.0.1 in both cases. Here are the results for the total
runtime:
Local Cluster
Sean,
Thanks for point this out. I’d have to experiment with the mapPartitions
method, but you’re right, this seems to address this issue directly. I’m also
connecting to Zookeeper to retrieve SparkConf parameters. I run into the same
issue with my Zookeeper driver, however, this is before
How many iterations are you running? Can you provide the exact details
about the size of the dataset? (how many data points, how many features) Is
this sparse or dense - and for the sparse case, how many non-zeroes? How
many partitions is your data RDD?
For very small datasets the scheduling
PYSPARK_PYTHON may work for you, it's used to specify which Python
interpreter should
be used in both driver and worker. For example, if anaconda was
installed as /anaconda on all the machines, then you can specify
PYSPARK_PYTHON=/anaconda/bin/python to use anaconda virtual
environment in
In Spark 1.1, it is possible to read from Cassandra using Hadoop jobs. See
examples/src/main/python/cassandra_inputformat.py for an example. You may
need to write your own key/value converters.
On Tue, Sep 2, 2014 at 11:10 AM, Oleg Ruchovets oruchov...@gmail.com
wrote:
Hi All ,
Is it
Hi everyone,
We are looking to apply a weight to each training example; this weight should
be used when computing the penalty of a misclassified example. For instance,
without weighting, each example is penalized 1 point when evaluating the model
of a classifier, such as a decision
Make sure your key pair is configured to access whatever region you're
deploying to - it defaults to us-east-1, but you can provide a custom one
with parameter --region.
On Sat, Aug 30, 2014 at 12:53 AM, David Matheson david.j.mathe...@gmail.com
wrote:
I'm following the latest documentation
NUm Iterations: For LR and SVM, I am using the default value of 100. All
the other parameters also I am using the default values. I am pretty much
reusing the code from BinaryClassification.scala. For Decision Tree, I dont
see any parameter for number of iterations inthe example code, so I did
My code seemed deadlock when I tried to do this:
object MoreRdd extends Serializable {
def apply(i: Int) = {
val rdd2 = sc.parallelize(0 to 10)
rdd2.map(j = i*10 + j).collect
}
}
val rdd1 = sc.parallelize(0 to 10)
val y = rdd1.map(i =
Yes, you can't use RDDs inside RDDs. But of course you can do this:
val nums = (0 to 10)
val y = nums.map(i = MoreRdd(i)).collect
On Tue, Sep 2, 2014 at 10:14 PM, cjwang c...@cjwang.us wrote:
My code seemed deadlock when I tried to do this:
object MoreRdd extends Serializable {
def
I didn't know this restriction. Thank you.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-an-RDD-in-another-RDD-causes-deadlock-tp13302p13304.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Assume say JavaWord count
I call the equivalent of a Mapper
JavaPairRDDString, Integer ones = words.mapToPair(,,,
Now right here I want to guarantee that each word starting with a
particular letter is processed in a specific partition - (Don't tell me
this is a dumb idea - I know that but in a
Hi,
I am planing to use a incoming DStream and calculate different measures from
the same stream.
I was able to calculate the individual measures separately and know I have
to merge them and spark streaming doesn't support outer join yet.
handlingtimePerWorker List(workerId, hanlingTime)
Those are interesting numbers. You haven't mentioned the dataset size in
your thread. This is a classic example of scalability and performance
assuming your baseline numbers are correct and you tuned correctly
everything on your cluster.
Putting my outside cap, there are multiple reasons for
Hello friends:
I have a follow-up to Andrew's well articulated answer below (thank you
for that).
(1) I've seen both of these invocations in various places:
(a) '--master yarn'
(b) '--master yarn-client'
the latter of which doesn't appear in
Check out LATERAL VIEW explode:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
On Tue, Sep 2, 2014 at 1:26 PM, gtinside gtins...@gmail.com wrote:
Hi ,
I am using jsonRDD in spark sql and having trouble iterating through array
inside the json object. Please refer
Hi Didata,
(1) Correct. The default deploy mode is `client`, so both masters `yarn`
and `yarn-client` run Spark in client mode. If you explicitly specify
master as `yarn-cluster`, Spark will run in cluster mode. If you implicitly
specify one deploy mode through the master (e.g. yarn-client) but
The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1
column of labels. From this dataset, I split 80% for training set and 20%
for test set. The features are integer counts and labels are binary (1/0).
thanks
--
View this message in context:
It works with spark.executor.extraClassPath – no exceptions in this case and
I’m getting expected results.
But to me it limits/complicates usage Akka based receivers a lot. Do you think
it should be considered as a bug?
Even if it’s not, can it be fixed/worked around by some classloading magic
Hmm... something is fishy here.
That's a *really* small dataset for a spark job, so almost all your time
will be spent in these overheads, but still you should be able to train a
logistic regression model with the default options and 100 iterations in
1s on a single machine.
Are you caching your
Assume say JavaWord count
I call the equivalent of a Mapper
JavaPairRDDString, Integer ones = words.mapToPair(,,,
Now right here I want to guarantee that each word starting with a
particular letter is processed in a specific partition - (Don't tell me
this is a dumb idea - I know that but in a
I am not sure if there is a quick fix for this as the actor is started in
the same actorSystem as the Spark's actor system. And since that actor
system is started as soon as the executor is launched, even before the
application code is launched, there isnt much classloader magic that can be
done.
On 08/29/2014 06:05 PM, Nick Chammas wrote:
Here’s a repro for PySpark:
|a = sc.parallelize([Nick,John,Bob])
a = a.repartition(24000)
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
|
When I try this on an EC2 cluster with 1.1.0-rc2 and Python 2.7, this is
what I get:
|a =
I'm pretty sure the issue was an interaction with another subsystem. Thanks
for your patience with me!
On Tue, Sep 2, 2014 at 10:05 AM, Sean Owen so...@cloudera.com wrote:
+user@
An executor is specific to an application, but an application can be
executing many jobs at once. So as I
On 09/01/2014 11:39 PM, Kevin Jung wrote:
http://www.adamcrume.com/blog/archive/2014/02/19/fixing-sparks-rdd-zip
http://www.adamcrume.com/blog/archive/2014/02/19/fixing-sparks-rdd-zip
Please check this url .
I got same problem in v1.0.1
In some cases, RDD losts several elements after zip so
what do you think about using a streamrdd in this case?
assuming streaming is available for pyspark, and you can collect based
on # events
best,
matt
On 09/02/2014 10:38 AM, Andrew Or wrote:
Spark-shell, or any other Spark application, returns the full results of
the job until it has
Thanks . I am not using hive context . I am loading data from Cassandra and
then converting it into json and then querying it through SQL context . Can
I use use hive context to query on a jsonRDD ?
Gaurav
--
View this message in context:
Hi,
On Wed, Sep 3, 2014 at 6:54 AM, salemi alireza.sal...@udo.edu wrote:
I was able to calculate the individual measures separately and know I have
to merge them and spark streaming doesn't support outer join yet.
Can't you assign some dummy key (e.g., index) before your processing and
then
I tried to run KafkaWordCount in a Spark standalone cluster. In this
application, the checkpoint directory was set as follows :
val sparkConf = new SparkConf().setAppName(KafkaWordCount)
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint(checkpoint)
After
Tobias,
That was what I was planing to do and technical lead is the opinion that
we should some how process a message only once and calculate all the
measures for the worker.
I was wondering if there is a solution out there for that?
Thanks,
Ali
Hi,
On Wed, Sep 3, 2014 at 6:54 AM, salemi
Yes you can. HiveContext's functionality is a strict superset of
SQLContext.
On Tue, Sep 2, 2014 at 6:35 PM, gtinside gtins...@gmail.com wrote:
Thanks . I am not using hive context . I am loading data from Cassandra and
then converting it into json and then querying it through SQL context .
Hi Andrew.
what should I do to set master on yarn, can you please pointing me on
command or documentation how to do it?
I am doing the following:
executed start-all.sh
[root@HDOP-B sbin]# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to
Hi,
I have the following ArrayBuffer:
*ArrayBuffer(5,3,1,4)*
Now, I want to get the number of elements in this ArrayBuffer and also the
first element of the ArrayBuffer. I used .length and .size but they are
returning 1 instead of 4.
I also used .head and .last for getting the first and the last
Hi ,
I change my command to :
./bin/spark-submit --master spark://HDOP-B.AGT:7077 --num-executors 3
--driver-memory 4g --executor-memory 2g --executor-cores 1
examples/src/main/python/pi.py 1000
and it fixed the problem.
I still have couple of questions:
PROCESS_LOCAL is not Yarn
Hi Deep,
Please find below results of ArrayBuffer in scala REPL
scala import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.ArrayBuffer
scala val a = ArrayBuffer(5,3,1,4)
a: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(5, 3, 1, 4)
scala a.head
res2: Int = 5
I just created it.
Here's ticket.
https://issues.apache.org/jira/browse/SPARK-3364
Thanks,
Kevin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/zip-equal-length-but-unequally-partition-tp13246p13330.html
Sent from the Apache Spark User List mailing list
Hi,
I changed to master to point on yarn and got such exceptions:
[root@HDOP-B spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563]#
./bin/spark-submit --master yarn://HDOP-M.AGT:8032 --num-executors 3
--driver-memory 4g --executor-memory 2g --executor-cores 1
examples/src/main/python/pi.py 1000
Hi Oleg. To run on YARN, simply set master to yarn. The YARN
configuration, located in a yarn-site.xml, determines where to look for the
YARN ResourceManager.
PROCESS_LOCAL is orthogonal to the choice of cluster resource manager. A
task is considered PROCESS_LOCAL when the executor it's running
I have a problem here.
When I run the commands that Rajesh has suggested in Scala REPL, they work
fine. But, I want to work in a Spark code, where I need to find the number
of elements in an ArrayBuffer. In Spark code, these things are not working.
How should I do that?
On Wed, Sep 3, 2014 at
67 matches
Mail list logo