RE: Unsupported language features in query

2014-09-02 Thread Cheng, Hao
Currently SparkSQL doesn’t support the row format/serde in CTAS. The work around is create the table first. -Original Message- From: centerqi hu [mailto:cente...@gmail.com] Sent: Tuesday, September 02, 2014 3:35 PM To: user@spark.apache.org Subject: Unsupported language features in

Re: Unsupported language features in query

2014-09-02 Thread centerqi hu
Thanks Cheng Hao Have a way of obtaining spark support hive statement list? Thanks 2014-09-02 15:39 GMT+08:00 Cheng, Hao hao.ch...@intel.com: Currently SparkSQL doesn’t support the row format/serde in CTAS. The work around is create the table first. -Original Message- From:

RE: Unsupported language features in query

2014-09-02 Thread Cheng, Hao
I am afraid no, but you can report that in Jira (https://issues.apache.org/jira/browse/SPARK) if you meet the missing functionalities in SparkSQL. SparkSQL aims to support all of the Hive functionalities (at least most of it) for HQL dialect. -Original Message- From: centerqi hu

New features (Discretization) for v1.x in xiangrui.pdf

2014-09-02 Thread filipus
is there any news about Discretization in spark? is there anything on git? i didnt find yet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-features-Discretization-for-v1-x-in-xiangrui-pdf-tp13256.html Sent from the Apache Spark User List mailing list

save schemardd to hive

2014-09-02 Thread centerqi hu
I want to save schemardd to hive val usermeta = hql( SELECT userid,idlist from usermeta WHERE day='2014-08-01' limit 1000) case class SomeClass(name:String,idlist:String) val schemardd = usermeta.map(t={SomeClass(t(0).toString,t(1).toString)}) How to save schemardd to hive? Thanks --

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-02 Thread filipus
i guess i found it https://github.com/LIDIAgroup/SparkFeatureSelection -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-features-Discretization-for-v1-x-in-xiangrui-pdf-tp13256p13261.html Sent from the Apache Spark User List mailing list archive at

Re: save schemardd to hive

2014-09-02 Thread centerqi hu
I got it import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ val usermeta = hql( SELECT userid,idlist from meta WHERE day='2014-08-01' limit 1000) case class

Re: save schemardd to hive

2014-09-02 Thread Silvio Fiorito
You can use saveAsTable or do an INSERT SparkSQL statement as well in case you need other Hive query features, like partitioning. On 9/2/14, 6:54 AM, centerqi hu cente...@gmail.com wrote: I got it import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ val hiveContext =

Re: save schemardd to hive

2014-09-02 Thread centerqi
Thank you very much, also can do this, it seems that I know too little about RDD 在 2014年9月2日,21:22,Silvio Fiorito silvio.fior...@granturing.com 写道: Once you’ve registered an RDD as a table, you can use it in SparkSQL statements: myrdd.registerAsTable(“my_table”) hql(“FROM my_table

Re: [Streaming] Cannot get executors to stay alive

2014-09-02 Thread Yana
Trying to bump this -- I'm basically asking if anyone has noticed the executor leaking memory. I have a large key space but no churn in the RDD so I don't understand why memory consumption grows with time. Any experiences with streaming welcomed -- I'm hoping I'm doing something wrong --

Spark Java Configuration.

2014-09-02 Thread pcsenthil
Team, I am new to Apache Spark and I didn't have much knowledge on hadoop or big data. I need clarifications on the below, How does Spark Configuration works, from a tutorial i got the below /SparkConf conf = new SparkConf().setAppName(Simple application)

Re: Where to save intermediate results?

2014-09-02 Thread Daniel Siegmann
I don't have any personal experience with Spark Streaming. Whether you store your data in HDFS or a database or something else probably depends on the nature of your use case. On Fri, Aug 29, 2014 at 10:38 AM, huylv huy.le...@insight-centre.org wrote: Hi Daniel, Your suggestion is definitely

Using Spark's ActionSystem for performing analytics using Akka

2014-09-02 Thread Aniket Bhatnagar
Sorry about the noob question, but I was just wondering if we use Spark's ActorSystem (SparkEnv.actorSystem), would it distribute actors across worker nodes or would the actors only run in driver JVM?

Re: pyspark yarn got exception

2014-09-02 Thread Andrew Or
Hi Oleg, If you are running Spark on a yarn cluster, you should set --master to yarn. By default this runs in client mode, which redirects all output of your application to your console. This is failing because it is trying to connect to a standalone master that you probably did not start. I am

Re: Spark-shell return results when the job is executing?

2014-09-02 Thread Andrew Or
Spark-shell, or any other Spark application, returns the full results of the job until it has finished executing. You could add a hook for it to write partial results to a file, but you may want to do so sparingly to incur fewer I/Os. If you have a large file and the result contains many lines, it

Re: Spark Java Configuration.

2014-09-02 Thread Yana Kadiyska
JavaSparkContext java_SC = new JavaSparkContext(conf); is the spark context. An application has a single spark context -- you won't be able to keep calling this -- you'll see an error if you try to create a second such object from the same application. Additionally, depending on your

Spark on YARN question

2014-09-02 Thread Greg Hill
I'm working on setting up Spark on YARN using the HDP technical preview - http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/ I have installed the Spark JARs on all the slave nodes and configured YARN to find the JARs. It seems like everything is working. Unless I'm

Re: Spark on YARN question

2014-09-02 Thread Matt Narrell
I’ve put my Spark JAR into HDFS, and specify the SPARK_JAR variable to point to the HDFS location of the jar. I’m not using any specialized configuration files (like spark-env.sh), but rather setting things either by environment variable per node, passing application arguments to the job, or

Re: Spark on YARN question

2014-09-02 Thread Andrew Or
Hi Greg, You should not need to even manually install Spark on each of the worker nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. the assembly + additional jars) to each of the containers for you. You can specify additional jars that your application depends

Re: Spark on YARN question

2014-09-02 Thread Greg Hill
Thanks. That sounds like how I was thinking it worked. I did have to install the JARs on the slave nodes for yarn-cluster mode to work, FWIW. It's probably just whichever node ends up spawning the application master that needs it, but it wasn't passed along from spark-submit. Greg From:

Spark on Mesos: Pyspark python libraries

2014-09-02 Thread Daniel Rodriguez
Hi all, I am getting started with spark and mesos, I already have spark running on a mesos cluster and I am able to start the scala spark and pyspark shells, yay!. I still have questions on how to distribute 3rd party python libraries since i want to use stuff like nltk and mlib on pyspark that

Serialized 3rd party libs

2014-09-02 Thread Matt Narrell
Hello, I’m using Spark streaming to aggregate data from a Kafka topic in sliding windows. Usually we want to persist this aggregated data to a MongoDB cluster, or republish to a different Kafka topic. When I include these 3rd party drivers, I usually get a NotSerializableException due to the

Re: saveAsTextFile makes no progress without caching RDD

2014-09-02 Thread jerryye
As an update. I'm still getting the same issue. I ended up doing a coalesce instead of a cache to get around the memory issue but saveAsTextFile still won't proceed without the coalesce or cache first. -- View this message in context:

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-09-02 Thread Sean Owen
+user@ An executor is specific to an application, but an application can be executing many jobs at once. So as I understand many jobs' tasks can be executing at once on an executor. You may not use your full 80-way parallelism if, for example, your data set doesn't have 80 partitions. I also

Regarding function unpersist on rdd

2014-09-02 Thread Zijing Guo
Hello, Can someone enlighten me regarding whether call unpersist on a rdd is expensive? what is the best solution to uncache the cached rdd? Thanks Edwin

Publishing a transformed DStream to Kafka

2014-09-02 Thread Massimiliano Tomassi
Hello all, after having applied several transformations to a DStream I'd like to publish all the elements in all the resulting RDDs to Kafka. What the best way to do that would be? Just using DStream.foreach and then RDD.foreach ? Is there any other built in utility for this use case? Thanks a

Re: Publishing a transformed DStream to Kafka

2014-09-02 Thread Tim Smith
I'd be interested in finding the answer too. Right now, I do: val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam)) kafkaOutMsgs.foreachRDD((rdd,time) = { rdd.foreach(rec = { writer.output(rec) }) } ) //where writer.ouput is a method that takes a string and writer is an instance of a

pyspark and cassandra

2014-09-02 Thread Oleg Ruchovets
Hi All , Is it possible to have cassandra as input data for PySpark. I found example for java - http://java.dzone.com/articles/sparkcassandra-stack-perform?page=0,0 and I am looking something similar for python. Thanks Oleg.

mllib performance on cluster

2014-09-02 Thread SK
Hi, I evaluated the runtime performance of some of the MLlib classification algorithms on a local machine and a cluster with 10 nodes. I used standalone mode and Spark 1.0.1 in both cases. Here are the results for the total runtime: Local Cluster

Re: Serialized 3rd party libs

2014-09-02 Thread Matt Narrell
Sean, Thanks for point this out. I’d have to experiment with the mapPartitions method, but you’re right, this seems to address this issue directly. I’m also connecting to Zookeeper to retrieve SparkConf parameters. I run into the same issue with my Zookeeper driver, however, this is before

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
How many iterations are you running? Can you provide the exact details about the size of the dataset? (how many data points, how many features) Is this sparse or dense - and for the sparse case, how many non-zeroes? How many partitions is your data RDD? For very small datasets the scheduling

Re: Spark on Mesos: Pyspark python libraries

2014-09-02 Thread Davies Liu
PYSPARK_PYTHON may work for you, it's used to specify which Python interpreter should be used in both driver and worker. For example, if anaconda was installed as /anaconda on all the machines, then you can specify PYSPARK_PYTHON=/anaconda/bin/python to use anaconda virtual environment in

Re: pyspark and cassandra

2014-09-02 Thread Kan Zhang
In Spark 1.1, it is possible to read from Cassandra using Hadoop jobs. See examples/src/main/python/cassandra_inputformat.py for an example. You may need to write your own key/value converters. On Tue, Sep 2, 2014 at 11:10 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi All , Is it

MLLib decision tree: Weights

2014-09-02 Thread Sameer Tilak
Hi everyone, We are looking to apply a weight to each training example; this weight should be used when computing the penalty of a misclassified example. For instance, without weighting, each example is penalized 1 point when evaluating the model of a classifier, such as a decision

Re: spark-ec2 [Errno 110] Connection time out

2014-09-02 Thread Daniil Osipov
Make sure your key pair is configured to access whatever region you're deploying to - it defaults to us-east-1, but you can provide a custom one with parameter --region. On Sat, Aug 30, 2014 at 12:53 AM, David Matheson david.j.mathe...@gmail.com wrote: I'm following the latest documentation

Re: mllib performance on cluster

2014-09-02 Thread SK
NUm Iterations: For LR and SVM, I am using the default value of 100. All the other parameters also I am using the default values. I am pretty much reusing the code from BinaryClassification.scala. For Decision Tree, I dont see any parameter for number of iterations inthe example code, so I did

Creating an RDD in another RDD causes deadlock

2014-09-02 Thread cjwang
My code seemed deadlock when I tried to do this: object MoreRdd extends Serializable { def apply(i: Int) = { val rdd2 = sc.parallelize(0 to 10) rdd2.map(j = i*10 + j).collect } } val rdd1 = sc.parallelize(0 to 10) val y = rdd1.map(i =

Re: Creating an RDD in another RDD causes deadlock

2014-09-02 Thread Sean Owen
Yes, you can't use RDDs inside RDDs. But of course you can do this: val nums = (0 to 10) val y = nums.map(i = MoreRdd(i)).collect On Tue, Sep 2, 2014 at 10:14 PM, cjwang c...@cjwang.us wrote: My code seemed deadlock when I tried to do this: object MoreRdd extends Serializable { def

Re: Creating an RDD in another RDD causes deadlock

2014-09-02 Thread cjwang
I didn't know this restriction. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-an-RDD-in-another-RDD-causes-deadlock-tp13302p13304.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

I am looking for a Java sample of a Partitioner

2014-09-02 Thread Steve Lewis
Assume say JavaWord count I call the equivalent of a Mapper JavaPairRDDString, Integer ones = words.mapToPair(,,, Now right here I want to guarantee that each word starting with a particular letter is processed in a specific partition - (Don't tell me this is a dumb idea - I know that but in a

Spark Streaming - how to implement multiple calculation using the same data set

2014-09-02 Thread salemi
Hi, I am planing to use a incoming DStream and calculate different measures from the same stream. I was able to calculate the individual measures separately and know I have to merge them and spark streaming doesn't support outer join yet. handlingtimePerWorker List(workerId, hanlingTime)

Re: mllib performance on cluster

2014-09-02 Thread Bharath Mundlapudi
Those are interesting numbers. You haven't mentioned the dataset size in your thread. This is a classic example of scalability and performance assuming your baseline numbers are correct and you tuned correctly everything on your cluster. Putting my outside cap, there are multiple reasons for

Re: Spark on YARN question

2014-09-02 Thread Dimension Data, LLC.
Hello friends: I have a follow-up to Andrew's well articulated answer below (thank you for that). (1) I've seen both of these invocations in various places: (a) '--master yarn' (b) '--master yarn-client' the latter of which doesn't appear in

Re: flattening a list in spark sql

2014-09-02 Thread Michael Armbrust
Check out LATERAL VIEW explode: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView On Tue, Sep 2, 2014 at 1:26 PM, gtinside gtins...@gmail.com wrote: Hi , I am using jsonRDD in spark sql and having trouble iterating through array inside the json object. Please refer

Re: Spark on YARN question

2014-09-02 Thread Andrew Or
Hi Didata, (1) Correct. The default deploy mode is `client`, so both masters `yarn` and `yarn-client` run Spark in client mode. If you explicitly specify master as `yarn-cluster`, Spark will run in cluster mode. If you implicitly specify one deploy mode through the master (e.g. yarn-client) but

Re: mllib performance on cluster

2014-09-02 Thread SK
The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1 column of labels. From this dataset, I split 80% for training set and 20% for test set. The features are integer counts and labels are binary (1/0). thanks -- View this message in context:

RE: [Streaming] Akka-based receiver with messages defined in uploaded jar

2014-09-02 Thread Anton Brazhnyk
It works with spark.executor.extraClassPath – no exceptions in this case and I’m getting expected results. But to me it limits/complicates usage Akka based receivers a lot. Do you think it should be considered as a bug? Even if it’s not, can it be fixed/worked around by some classloading magic

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
Hmm... something is fishy here. That's a *really* small dataset for a spark job, so almost all your time will be spent in these overheads, but still you should be able to train a logistic regression model with the default options and 100 iterations in 1s on a single machine. Are you caching your

I am looking for a Java sample of a Partitioner

2014-09-02 Thread Steve Lewis
Assume say JavaWord count I call the equivalent of a Mapper JavaPairRDDString, Integer ones = words.mapToPair(,,, Now right here I want to guarantee that each word starting with a particular letter is processed in a specific partition - (Don't tell me this is a dumb idea - I know that but in a

Re: [Streaming] Akka-based receiver with messages defined in uploaded jar

2014-09-02 Thread Tathagata Das
I am not sure if there is a quick fix for this as the actor is started in the same actorSystem as the Spark's actor system. And since that actor system is started as soon as the executor is launched, even before the application code is launched, there isnt much classloader magic that can be done.

Re: [PySpark] large # of partitions causes OOM

2014-09-02 Thread Matthew Farrellee
On 08/29/2014 06:05 PM, Nick Chammas wrote: Here’s a repro for PySpark: |a = sc.parallelize([Nick,John,Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) | When I try this on an EC2 cluster with 1.1.0-rc2 and Python 2.7, this is what I get: |a =

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-09-02 Thread Victor Tso-Guillen
I'm pretty sure the issue was an interaction with another subsystem. Thanks for your patience with me! On Tue, Sep 2, 2014 at 10:05 AM, Sean Owen so...@cloudera.com wrote: +user@ An executor is specific to an application, but an application can be executing many jobs at once. So as I

Re: zip equal-length but unequally-partition

2014-09-02 Thread Matthew Farrellee
On 09/01/2014 11:39 PM, Kevin Jung wrote: http://www.adamcrume.com/blog/archive/2014/02/19/fixing-sparks-rdd-zip http://www.adamcrume.com/blog/archive/2014/02/19/fixing-sparks-rdd-zip Please check this url . I got same problem in v1.0.1 In some cases, RDD losts several elements after zip so

Re: Spark-shell return results when the job is executing?

2014-09-02 Thread Matthew Farrellee
what do you think about using a streamrdd in this case? assuming streaming is available for pyspark, and you can collect based on # events best, matt On 09/02/2014 10:38 AM, Andrew Or wrote: Spark-shell, or any other Spark application, returns the full results of the job until it has

Re: flattening a list in spark sql

2014-09-02 Thread gtinside
Thanks . I am not using hive context . I am loading data from Cassandra and then converting it into json and then querying it through SQL context . Can I use use hive context to query on a jsonRDD ? Gaurav -- View this message in context:

Re: Spark Streaming - how to implement multiple calculation using the same data set

2014-09-02 Thread Tobias Pfeiffer
Hi, On Wed, Sep 3, 2014 at 6:54 AM, salemi alireza.sal...@udo.edu wrote: I was able to calculate the individual measures separately and know I have to merge them and spark streaming doesn't support outer join yet. Can't you assign some dummy key (e.g., index) before your processing and then

What is the appropriate privileges needed for writting files into checkpoint directory?

2014-09-02 Thread Tao Xiao
I tried to run KafkaWordCount in a Spark standalone cluster. In this application, the checkpoint directory was set as follows : val sparkConf = new SparkConf().setAppName(KafkaWordCount) val ssc = new StreamingContext(sparkConf, Seconds(2)) ssc.checkpoint(checkpoint) After

Re: Spark Streaming - how to implement multiple calculation using the same data set

2014-09-02 Thread Alireza Salemi
Tobias, That was what I was planing to do and technical lead is the opinion that we should some how process a message only once and calculate all the measures for the worker. I was wondering if there is a solution out there for that? Thanks, Ali Hi, On Wed, Sep 3, 2014 at 6:54 AM, salemi

Re: flattening a list in spark sql

2014-09-02 Thread Michael Armbrust
Yes you can. HiveContext's functionality is a strict superset of SQLContext. On Tue, Sep 2, 2014 at 6:35 PM, gtinside gtins...@gmail.com wrote: Thanks . I am not using hive context . I am loading data from Cassandra and then converting it into json and then querying it through SQL context .

Re: pyspark yarn got exception

2014-09-02 Thread Oleg Ruchovets
Hi Andrew. what should I do to set master on yarn, can you please pointing me on command or documentation how to do it? I am doing the following: executed start-all.sh [root@HDOP-B sbin]# ./start-all.sh starting org.apache.spark.deploy.master.Master, logging to

Number of elements in ArrayBuffer

2014-09-02 Thread Deep Pradhan
Hi, I have the following ArrayBuffer: *ArrayBuffer(5,3,1,4)* Now, I want to get the number of elements in this ArrayBuffer and also the first element of the ArrayBuffer. I used .length and .size but they are returning 1 instead of 4. I also used .head and .last for getting the first and the last

Re: pyspark yarn got exception

2014-09-02 Thread Oleg Ruchovets
Hi , I change my command to : ./bin/spark-submit --master spark://HDOP-B.AGT:7077 --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 examples/src/main/python/pi.py 1000 and it fixed the problem. I still have couple of questions: PROCESS_LOCAL is not Yarn

Re: Number of elements in ArrayBuffer

2014-09-02 Thread Madabhattula Rajesh Kumar
Hi Deep, Please find below results of ArrayBuffer in scala REPL scala import scala.collection.mutable.ArrayBuffer import scala.collection.mutable.ArrayBuffer scala val a = ArrayBuffer(5,3,1,4) a: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(5, 3, 1, 4) scala a.head res2: Int = 5

Re: zip equal-length but unequally-partition

2014-09-02 Thread Kevin Jung
I just created it. Here's ticket. https://issues.apache.org/jira/browse/SPARK-3364 Thanks, Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/zip-equal-length-but-unequally-partition-tp13246p13330.html Sent from the Apache Spark User List mailing list

Re: pyspark yarn got exception

2014-09-02 Thread Oleg Ruchovets
Hi, I changed to master to point on yarn and got such exceptions: [root@HDOP-B spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563]# ./bin/spark-submit --master yarn://HDOP-M.AGT:8032 --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 examples/src/main/python/pi.py 1000

Re: pyspark yarn got exception

2014-09-02 Thread Sandy Ryza
Hi Oleg. To run on YARN, simply set master to yarn. The YARN configuration, located in a yarn-site.xml, determines where to look for the YARN ResourceManager. PROCESS_LOCAL is orthogonal to the choice of cluster resource manager. A task is considered PROCESS_LOCAL when the executor it's running

Re: Number of elements in ArrayBuffer

2014-09-02 Thread Deep Pradhan
I have a problem here. When I run the commands that Rajesh has suggested in Scala REPL, they work fine. But, I want to work in a Spark code, where I need to find the number of elements in an ArrayBuffer. In Spark code, these things are not working. How should I do that? On Wed, Sep 3, 2014 at