Re: heterogeneous cluster setup

2014-12-04 Thread Victor Tso-Guillen
To reiterate, it's very important for Spark's workers to have the same memory available. Think about Spark uniformly chopping up your data and distributing the work to the nodes. The algorithm is not designed to consider that a worker has less memory available than some other worker. On Thu, Dec

Re: Any ideas why a few tasks would stall

2014-12-04 Thread Ankit Soni
I ran into something similar before. 19/20 partitions would complete very quickly, and 1 would take the bulk of time and shuffle reads writes. This was because the majority of partitions were empty, and 1 had all the data. Perhaps something similar is going on here - I would suggest taking a

Re: Serializing with Kryo NullPointerException - Java

2014-12-04 Thread Robin Keunen
Using dependency groupIdcom.esotericsoftware/groupId artifactIdkryo-shaded/artifactId version3.0.0/version /dependency Instead of dependency groupIdcom.esotericsoftware.kryo/groupId artifactIdkryo/artifactId

Re: Any ideas why a few tasks would stall

2014-12-04 Thread Sameer Farooqui
Good point, Ankit. Steve - You can click on the link for '27' in the first column to get a break down of how much data is in each of those 116 cached partitions. But really, you want to also understand how much data is in the 4 non-cached partitions, as they may be huge. One thing you can try

Re: Necessity for rdd replication.

2014-12-04 Thread Sameer Farooqui
In general, most use cases don't need the RDD to be replicated in memory multiple times. It would be a rare exception to do this. If it's really expensive (time consuming) to recomputing a lost partition or if the use case is extremely time sensitive, then maybe you could replicate it in memory.

Re: Monitoring Spark

2014-12-04 Thread Sameer Farooqui
Are you running Spark in Local or Standalone mode? In either mode, you should be able to hit port 4040 (to see the Spark Jobs/Stages/Storage/Executors UI) on the machine where the driver is running. However, in local mode, you won't have a Spark Master UI on 7080 or a Worker UI on 7081. You can

[no subject]

2014-12-04 Thread Subong Kim
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

map function

2014-12-04 Thread Yifan LI
Hi, I have a RDD like below: (1, (10, 20)) (2, (30, 40, 10)) (3, (30)) … Is there any way to map it to this: (10,1) (20,1) (30,2) (40,2) (10,2) (30,3) … generally, for each element, it might be mapped to multiple. Thanks in advance! Best, Yifan LI

Re: [question]Where can I get the log file

2014-12-04 Thread Prannoy
Hi, You can access your logs in your /spark_home_directory/logs/ directory . cat the file names and you will get the logs. Thanks. On Thu, Dec 4, 2014 at 2:27 PM, FFeng [via Apache Spark User List] ml-node+s1001560n20344...@n3.nabble.com wrote: I have wrote data to spark log. I get it

Re: MLLIB model export: PMML vs MLLIB serialization

2014-12-04 Thread manish_k
Hi Sourabh, I came across same problem as you. One workable solution for me was to serialize the parts of model that can be used again to recreate it. I serialize RDD's in my model using saveAsObjectFile with a time stamp attached to it in HDFS. My other spark application read from the latest

Re: Problem creating EC2 cluster using spark-ec2

2014-12-04 Thread Dave Challis
Fantastic, thanks for the quick fix! On 3 December 2014 at 22:11, Andrew Or and...@databricks.com wrote: This should be fixed now. Thanks for bringing this to our attention. 2014-12-03 13:31 GMT-08:00 Andrew Or and...@databricks.com: Yeah this is currently broken for 1.1.1. I will submit a

SchemaRDD partition on specific column values?

2014-12-04 Thread nitin
Hi All, I want to hash partition (and then cache) a schema RDD in way that partitions are based on hash of the values of a column (ID column in my case). e.g. if my table has ID column with values as 1,2,3,4,5,6,7,8,9 and spark.sql.shuffle.partitions is configured as 3, then there should be 3

Determination of number of RDDs

2014-12-04 Thread Deep Pradhan
Hi, I have a graph and I want to create RDDs equal in number to the nodes in the graph. How can I do that? If I have 10 nodes then I want to create 10 rdds. Is that possible in GraphX? Like in C language we have array of pointers. Do we have array of RDDs in Spark. Can we create such an array and

Re: Determination of number of RDDs

2014-12-04 Thread Ankur Dave
At 2014-12-04 02:08:45 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote: I have a graph and I want to create RDDs equal in number to the nodes in the graph. How can I do that? If I have 10 nodes then I want to create 10 rdds. Is that possible in GraphX? This is possible: you can collect

Re: MLLib: loading saved model

2014-12-04 Thread manish_k
Hi Sameer, Your model recreation should be: val model = new LinearRegressionModel(weights, intercept) As you have already got weights for linear regression model using stochastic gradient descent, you just have to use LinearRegressionModel to construct new model. Other points to notice is that

Re: GraphX Pregel halting condition

2014-12-04 Thread Ankur Dave
There's no built-in support for doing this, so the best option is to copy and modify Pregel to check the accumulator at the end of each iteration. This is robust and shouldn't be too hard, since the Pregel code is short and only uses public GraphX APIs. Ankur At 2014-12-03 09:37:01 -0800, Jay

R: map function

2014-12-04 Thread Paolo Platter
Hi, rdd.flatMap( e = e._2.map( i = ( i, e._1))) Should work, but I didn't test it so maybe I'm missing something. Paolo Inviata dal mio Windows Phone Da: Yifan LImailto:iamyifa...@gmail.com Inviato: ‎04/‎12/‎2014 09:27 A:

Example usage of StreamingListener

2014-12-04 Thread Hafiz Mujadid
Hi! does anybody has some useful example of StreamingListener interface. When and how can we use this interface to stop streaming when one batch of data is processed? Thanks alot -- View this message in context:

Re: running Spark Streaming just once and stop it

2014-12-04 Thread Hafiz Mujadid
Hi Kal El! Have you done stopping streaming after first iteration? if yes can you share example code. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/running-Spark-Streaming-just-once-and-stop-it-tp1382p20359.html Sent from the Apache Spark User

Re: MLlib Naive Bayes classifier confidence

2014-12-04 Thread MariusFS
That was it, Thanks. (Posting here so people know it's the right answer in case they have the same need :) ). sowen wrote Probabilities won't sum to 1 since this expression doesn't incorporate the probability of the evidence, I imagine? it's constant across classes so is usually excluded. It

Re: Example usage of StreamingListener

2014-12-04 Thread Hafiz Mujadid
Thanks Akhil You are so helping Dear. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Example-usage-of-StreamingListener-tp20357p20362.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: map function

2014-12-04 Thread Yifan LI
Thanks, Paolo and Mark. :) On 04 Dec 2014, at 11:58, Paolo Platter paolo.plat...@agilelab.it wrote: Hi, rdd.flatMap( e = e._2.map( i = ( i, e._1))) Should work, but I didn't test it so maybe I'm missing something. Paolo Inviata dal mio Windows Phone Da: Yifan LI

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-04 Thread sahanbull
Hi Davies, Thanks for the reply The problem is I have empty dictionaries in my field3 as well. It gives me an error : Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/sql.py, line 1042, in inferSchema schema = _infer_schema(first)

Re: Does filter on an RDD scan every data item ?

2014-12-04 Thread nsareen
Thanks for the reply! To be honest, I was expecting spark to have some sort of Indexing for keys, which would help it locate the keys efficiently. I wasn't using Spark SQL here, but if it helps perform this efficiently, i can try it out, can you please elaborate, how will it be helpful in this

Re: Does filter on an RDD scan every data item ?

2014-12-04 Thread nsareen
I'm not sure sample is what i was looking for. As mentioned in another post above. this is what i'm looking for. 1) My RDD contains this structure. Tuple2CustomTuple,Double. 2) Each CustomTuple is a combination of string id's e.g. CustomTuple.dimensionOne=AE232323

RE: Spark Streaming empty RDD issue

2014-12-04 Thread Shao, Saisai
Hi, According to my knowledge of current Spark Streaming Kafka connector, I think there's no chance for APP user to detect such kind of failure, this will either be done by Kafka consumer with ZK coordinator, either by ReceiverTracker in Spark Streaming, so I think you don't need to take care

Re: How to Integrate openNLP with Spark

2014-12-04 Thread Nikhil
Did anyone get a chance to look at this? Please provide some help. Thanks Nikhil -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Integrate-openNLP-with-Spark-tp20117p20368.html Sent from the Apache Spark User List mailing list archive at

Efficient way to get top K values per key in (key, value) RDD?

2014-12-04 Thread Theodore Vasiloudis
Hello everyone, I was wondering what is the most efficient way for retrieving the top K values per key in a (key, value) RDD. The simplest way I can think of is to do a groupByKey, sort the iterables and then take the top K elements for every key. But reduceByKey is an operation that can be

Market Basket Analysis

2014-12-04 Thread Rohit Pujari
Hello Folks: I'd like to do market basket analysis using spark, what're my options? Thanks, Rohit Pujari Solutions Architect, Hortonworks -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information

Re: Efficient way to get top K values per key in (key, value) RDD?

2014-12-04 Thread Sean Owen
You probably want to use combineByKey, and create an empty min queue for each key. Merge values into the queue if its size is K. If = K, only merge the value if it exceeds the smallest element; if so add it and remove the smallest element. This gives you an RDD of keys mapped to collections of

Re: Spark-Streaming: output to cassandra

2014-12-04 Thread Akhil Das
You can use the datastax's Cassandra connector. https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md Thanks Best Regards On Thu, Dec 4, 2014 at 8:21 PM, m.sar...@accenture.com wrote: Hi, I have written the code below which is streaming data from kafka, and

Failed to read chunk exception

2014-12-04 Thread Steve Lewis
I am running a large job using 4000 partitions - after running for four hours on a 16 node cluster it fails with the following message. The errors are in spark code and seem address unreliability at the level of the disk - Anyone seen this and know what is going on and how to fix it. Exception

Re: Spark-Streaming: output to cassandra

2014-12-04 Thread Gerard Maas
I guess he's already doing so, given the 'saveToCassandra' usage. What I don't understand is the question how do I specify a batch. That doesn't make much sense to me. Could you explain further? -kr, Gerard. On Thu, Dec 4, 2014 at 5:36 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You can

Re: Any ideas why a few tasks would stall

2014-12-04 Thread Steve Lewis
Thanks - I found the same thing - calling boolean forceShuffle = true; myRDD = myRDD.coalesce(120,forceShuffle ); worked - there were 120 partitions but forcing a shuffle distributes the work I believe there is a bug in my code causing memory to accumulate as partitions grow in

How can a function get a TaskContext

2014-12-04 Thread Steve Lewis
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/TaskContext.java has a Java implementation if TaskContext wit a very useful method /** * Return the currently active TaskContext. This can be called inside of * user functions to access contextual information about

Stateful mapPartitions

2014-12-04 Thread Akshat Aranya
Is it possible to have some state across multiple calls to mapPartitions on each partition, for instance, if I want to keep a database connection open?

Re: Spark SQL with a sorted file

2014-12-04 Thread Michael Armbrust
I'll add that some of our data formats will actual infer this sort of useful information automatically. Both parquet and cached inmemory tables keep statistics on the min/max value for each column. When you have predicates over these sorted columns, partitions will be eliminated if they can't

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-04 Thread Davies Liu
Which version of Spark are you using? inferSchema() is improved to support empty dict in 1.2+, could you try the 1.2-RC1? Also, you can use applySchema(): from pyspark.sql import * fields = [StructField('field1', IntegerType(), True), StructField('field2', StringType(), True),

Re: Can not see any spark metrics on ganglia-web

2014-12-04 Thread danilopds
I used the command below because I'm using Spark 1.0.2 built with SBT and it worked. SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_GANGLIA_LGPL=true sbt/sbt assembly -- View this message in context:

RE: Determination of number of RDDs

2014-12-04 Thread Kapil Malik
Regarding: Can we create such an array and then parallelize it? Parallelizing an array of RDDs - i.e. RDD[RDD[x]] is not possible. RDD is not serializable. From: Deep Pradhan [mailto:pradhandeep1...@gmail.com] Sent: 04 December 2014 15:39 To: user@spark.apache.org Subject: Determination of

Re: Spark metrics for ganglia

2014-12-04 Thread danilopds
Hello Samudrala, Did you solve this issue about view metrics in Ganglia?? Because I have the same problem. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-metrics-for-ganglia-tp14335p20385.html Sent from the Apache Spark User List mailing

Re: Any ideas why a few tasks would stall

2014-12-04 Thread akhandeshi
This did not work for me. that is, rdd.coalesce(200, forceShuffle) . Does anyone have ideas on how to distribute your data evenly and co-locate partitions of interest? -- View this message in context:

Re: SQL query in scala API

2014-12-04 Thread Arun Luthra
Is that Spark SQL? I'm wondering if it's possible without spark SQL. On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com wrote: You may do this: table(users).groupBy('zip)('zip, count('user), countDistinct('user)) On 12/4/14 8:47 AM, Arun Luthra wrote: I'm wondering how to

Re: Spark SQL table Join, one task is taking long

2014-12-04 Thread Venkat Subramanian
Hi Cheng, Thank you very much for taking your time and providing a detailed explanation. I tried a few things you suggested and some more things. The ContactDetail table (8 GB) is the fact table and DAgents is the Dim table (500 KB), reverse of what you are assuming, but your ideas still apply.

spark-ec2 Web UI Problem

2014-12-04 Thread Xingwei Yang
Hi Guys: I have succsefully installed apache-spark on Amazon ec2 using spark-ec2 command and I could login to the master node. Here is the installation message: RSYNC'ing /etc/ganglia to slaves... ec2-54-148-197-89.us-west-2.compute.amazonaws.com Shutting down GANGLIA gmond: [FAILED]

Unable to run applications on clusters on EC2

2014-12-04 Thread Xingwei Yang
I think it is related to my previous questions, but I separate them. In my previous question, I could not connect to WebUI even though I could log into the cluster without any problem. Also, I tried lynx localhost:8080 and I could get the information about the cluster; I could also user

How to make symbol for one column in Spark SQL.

2014-12-04 Thread Tim Chou
I have tried to use function where and filter in SchemaRDD. I have build class for tuple/record in the table like this: case class Region(num:Int, str1:String, str2:String) I also successfully create a SchemaRDD. scala val results = sqlContext.sql(select * from region) results:

How to extend an one-to-one RDD of Spark that can be persisted?

2014-12-04 Thread Peng Cheng
In my project I extend a new RDD type that wraps another RDD and some metadata. The code I use is similar to FilteredRDD implementation: case class PageRowRDD( self: RDD[PageRow], @transient keys: ListSet[KeyLike] = ListSet() ){ override def getPartitions:

Re: How to make symbol for one column in Spark SQL.

2014-12-04 Thread Michael Armbrust
You need to import sqlContext._ On Thu, Dec 4, 2014 at 2:26 PM, Tim Chou timchou@gmail.com wrote: I have tried to use function where and filter in SchemaRDD. I have build class for tuple/record in the table like this: case class Region(num:Int, str1:String, str2:String) I also

Re: How to make symbol for one column in Spark SQL.

2014-12-04 Thread Tim Chou
... Thank you! I'm so stupid... This is the only thing I miss in the tutorial...orz Thanks, Tim 2014-12-04 16:49 GMT-06:00 Michael Armbrust mich...@databricks.com: You need to import sqlContext._ On Thu, Dec 4, 2014 at 2:26 PM, Tim Chou timchou@gmail.com wrote: I have tried to use

Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread bonnahu
I am trying to load a large Hbase table into SPARK RDD to run a SparkSQL query on the entity. For an entity with about 6 million rows, it will take about 35 seconds to load it to RDD. Is it expected? Is there any way to shorten the loading process? I have been getting some tips from

Re: SQL query in scala API

2014-12-04 Thread Stéphane Verlet
Disclaimer : I am new at Spark I did something similar in a prototype which works but I that did not test at scale yet val agg =3D users.mapValues(_ =3D 1)..aggregateByKey(new CustomAggregation())(CustomAggregation.sequenceOp, CustomAggregation.comboO= p) class CustomAggregation() extends

Re: Spark 1.1.0 Can not read snappy compressed sequence file

2014-12-04 Thread Stéphane Verlet
Yes , It is working with this in spark-env.sh export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native export

printing mllib.linalg.vector

2014-12-04 Thread debbie
Basic question: What is the best way to loop through one of these and print their components? Convert them to an array? Thanks Deb

Re: Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread Jörn Franke
Hi, What is your cluster setup? How mich memory do you have? How much space does one row only consisting of the 3 columns consume? Do you run other stuff in the background? Best regards Am 04.12.2014 23:57 schrieb bonnahu bonn...@gmail.com: I am trying to load a large Hbase table into SPARK

Getting all the results from MySQL table

2014-12-04 Thread gargp
Hi, I am a new spark and scala user. Was trying to use JdbcRDD to query a MySQL table. It needs a lowerbound and upperbound as parameters, but I want to get all the records from the table in a single query. Is there a way I can do that? -- View this message in context:

representing RDF literals as vertex properties

2014-12-04 Thread spr
@ankurdave's concise code at https://gist.github.com/ankurdave/587eac4d08655d0eebf9, responding to an earlier thread (http://apache-spark-user-list.1001560.n3.nabble.com/How-to-construct-graph-in-graphx-tt16335.html#a16355) shows how to build a graph with multiple edge-types (predicates in

Re: Stateful mapPartitions

2014-12-04 Thread Akshat Aranya
I want to have a database connection per partition of the RDD, and then reuse that connection whenever mapPartitions is called, which results in compute being called on the partition. On Thu, Dec 4, 2014 at 11:07 AM, Paolo Platter paolo.plat...@agilelab.it wrote: Could you provide some further

Re: Market Basket Analysis

2014-12-04 Thread Tobias Pfeiffer
Hi, On Thu, Dec 4, 2014 at 11:58 PM, Rohit Pujari rpuj...@hortonworks.com wrote: I'd like to do market basket analysis using spark, what're my options? To do it or not to do it ;-) Seriously, could you elaborate a bit on what you want to know? Tobias

Re: Stateful mapPartitions

2014-12-04 Thread Tobias Pfeiffer
Hi, On Fri, Dec 5, 2014 at 3:56 AM, Akshat Aranya aara...@gmail.com wrote: Is it possible to have some state across multiple calls to mapPartitions on each partition, for instance, if I want to keep a database connection open? If you're using Scala, you can use a singleton object, this will

Re: representing RDF literals as vertex properties

2014-12-04 Thread Ankur Dave
At 2014-12-04 16:26:50 -0800, spr s...@yarcdata.com wrote: I'm also looking at how to represent literals as vertex properties. It seems one way to do this is via positional convention in an Array/Tuple/List that is the VD; i.e., to represent height, weight, and eyeColor, the VD could be a

Re: Market Basket Analysis

2014-12-04 Thread Rohit Pujari
Sure, I'm looking to perform frequent item set analysis on POS data set. Apriori is a classic algorithm used for such tasks. Since Apriori implementation is not part of MLLib yet, (see https://issues.apache.org/jira/browse/SPARK-4001) What are some other options/algorithms I could use to

spark assembly jar caused changed on src filesystem error

2014-12-04 Thread Hu, Leo
Hi all when I execute: /spark-1.1.1-bin-hadoop2.4/bin/spark-submit --verbose --master yarn-cluster --class spark.SimpleApp --jars /spark-1.1.1-bin-hadoop2.4/lib/spark-assembly-1.1.1-hadoop2.4.0.jar --executor-memory 1G --num-executors 2

Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
I got the following error during Spark startup (Yarn-client mode): 14/12/04 19:33:58 INFO Client: Uploading resource file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar -

Re: Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread bonnahu
Hi, Here is the configuration of the cluster: Workers: 2 For each worker, Cores: 24 Total, 0 Used Memory: 69.6 GB Total, 0.0 B Used For the spark.executor.memory, I didn't set it, so it should be the default value 512M. How much space does one row only consisting of the 3 columns consume? the

SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Rahul Bindlish
Is it a limitation that spark does not support more than one case class at a time. Regards, Rahul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20415.html Sent from the Apache Spark User

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Looks like somehow Spark failed to find the core-site.xml in /et/hadoop/conf I've already set the following env variables: export YARN_CONF_DIR=/etc/hadoop/conf export HADOOP_CONF_DIR=/etc/hadoop/conf export HBASE_CONF_DIR=/etc/hbase/conf Should I put $HADOOP_CONF_DIR/* to HADOOP_CLASSPATH?

Re: Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread bonnahu
Hi Ted, Here is the information about the Regions: Region Server Region Count http://regionserver1:60030/ 44 http://regionserver2:60030/ 39 http://regionserver3:60030/ 55 -- View this message in context:

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Actually my HADOOP_CLASSPATH has already been set to include /etc/hadoop/conf/* export HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase classpath) Jianshi On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
On Fri, Dec 5, 2014 at 12:53 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: Is it a limitation that spark does not support more than one case class at a time. What do you mean? I do not have the slightest idea what you *could* possibly mean by to support a case class. Tobias

Window function by Spark SQL

2014-12-04 Thread Dai, Kevin
Hi, ALL How can I group by one column and order by another one, then select the first row for each group (which is just like window function doing) by SparkSQL? Best Regards, Kevin.

SparkContext.textfile() cannot load file using UNC path on windows

2014-12-04 Thread Ningjun Wang
SparkContext.textfile() cannot load file using UNC path on windows I run the following on Windows XP val conf = new SparkConf().setAppName(testproj1.ClassificationEngine).setMaster(local) val sc = new SparkContext(conf)

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Rahul Bindlish
Hi Tobias, Thanks Tobias for your response. I have created objectfiles [person_obj,office_obj] from csv[person_csv,office_csv] files using case classes[person,office] with API (saveAsObjectFile) Now I restarted spark-shell and load objectfiles using API(objectFile). *Once any of one

Re: Window function by Spark SQL

2014-12-04 Thread Cheng Lian
Window functions are not supported yet, but there is a PR for it: https://github.com/apache/spark/pull/2953 On 12/5/14 12:22 PM, Dai, Kevin wrote: Hi, ALL How can I group by one column and order by another one, then select the first row for each group (which is just like window function

Re: SchemaRDD partition on specific column values?

2014-12-04 Thread nitin
With some quick googling, I learnt that I can we can provide distribute by coulmn_name in hive ql to distribute data based on a column values. My question now if I use distribute by id, will there be any performance improvements? Will I be able to avoid data movement in shuffle(Excahnge before

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Looks like the datanucleus*.jar shouldn't appear in the hdfs path in Yarn-client mode. Maybe this patch broke yarn-client. https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 Jianshi On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 1:29 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: I have created objectfiles [person_obj,office_obj] from csv[person_csv,office_csv] files using case classes[person,office] with API (saveAsObjectFile) Now I restarted spark-shell and load

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Correction: According to Liancheng, this hotfix might be the root cause: https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce Jianshi On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like the datanucleus*.jar shouldn't appear in

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Rahul Bindlish
Tobias, Thanks for quick reply. Definitely, after restart case classes need to be defined again. I have done so thats why spark is able to load objectfile [e.g. person_obj] and spark has maintained serialVersionUID [person_obj]. Next time when I am trying to load another objectfile [e.g.

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: I have done so thats why spark is able to load objectfile [e.g. person_obj] and spark has maintained serialVersionUID [person_obj]. Next time when I am trying to load another objectfile [e.g.

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Patrick Wendell
Thanks for flagging this. I reverted the relevant YARN fix in Spark 1.2 release. We can try to debug this in master. On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I created a ticket for this: https://issues.apache.org/jira/browse/SPARK-4757 Jianshi On Fri,

Re: How to incrementally compile spark examples using mvn

2014-12-04 Thread MEETHU MATHEW
Hi all, I made some code changes  in mllib project and as mentioned in the previous mails I did  mvn install -pl mllib  Now  I run a program in examples using run-example, the new code is not executing.Instead the previous code itself is running. But if I do an  mvn install in the entire spark

Re: Spark-Streaming: output to cassandra

2014-12-04 Thread m.sarosh
Hi Gerard/Akhil, By how do I specify a batch I was trying to ask that when does the data in the JavaDStream gets flushed into Cassandra table?. I read somewhere that the streaming data in batches gets written in Cassandra. This batch can be of some particular time, or one particular run.

RDD.aggregate?

2014-12-04 Thread ll
can someone please explain how RDD.aggregate works? i looked at the average example done with aggregate() but i'm still confused about this function... much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434.html Sent from

Re: Spark-Streaming: output to cassandra

2014-12-04 Thread Akhil Das
Batch is the batch duration that you are specifying while creating the StreamingContext, so at the end of every batch's computation the data will get flushed to Cassandra, and why are you stopping your program with Ctrl + C? You can always specify the time with the sc.awaitTermination(Duration)

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Rahul Bindlish
Tobias, Find csv and scala files and below are steps: 1. Copy csv files in current directory. 2. Open spark-shell from this directory. 3. Run one_scala file which will create object-files from csv-files in current directory. 4. Restart spark-shell 5. a. Run two_scala file, while running it is

Re: Spark executor lost

2014-12-04 Thread Akhil Das
It says connection refused, just make sure the network is configured properly (open the ports between master and the worker nodes). If the ports are configured correctly, then i assume the process is getting killed for some reason and hence connection refused. Thanks Best Regards On Fri, Dec 5,

Re: how do you turn off info logging when running in local mode

2014-12-04 Thread Akhil Das
Yes, there is away. Just add the following piece of code before creating the SparkContext. import org.apache.log4j.Logger import org.apache.log4j.Level Logger.getLogger(org).setLevel(Level.OFF) Logger.getLogger(akka).setLevel(Level.OFF) Thanks Best Regards On Fri, Dec 5, 2014 at 12:48 AM,

Re: spark-ec2 Web UI Problem

2014-12-04 Thread Akhil Das
Its working http://ec2-54-148-248-162.us-west-2.compute.amazonaws.com:8080/ If it ddn't install it correctly, then you could try spark-ec2 script with *--resume* i think Thanks Best Regards On Fri, Dec 5, 2014 at 3:11 AM, Xingwei Yang happy...@gmail.com wrote: Hi Guys: I have succsefully

Clarifications on Spark

2014-12-04 Thread Ajay
Hello, I work for an eCommerce company. Currently we are looking at building a Data warehouse platform as described below: DW as a Service | REST API | SQL On No SQL (Drill/Pig/Hive/Spark SQL) | No SQL databases (One or more. May be RDBMS directly too) | (Bulk load) My SQL

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
Sorry for the late of follow-up. I used Hao's DESC EXTENDED command and found some clue: new (broadcast broken Spark build): parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} old (broadcast working

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
If I run ANALYZE without NOSCAN, then Hive can successfully get the size: parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589, COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296} Is Hive's PARQUET support broken? Jianshi On Fri, Dec 5, 2014 at 3:30

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-04 Thread Tobias Pfeiffer
Rahul, On Fri, Dec 5, 2014 at 3:51 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: 1. Copy csv files in current directory. 2. Open spark-shell from this directory. 3. Run one_scala file which will create object-files from csv-files in current directory. 4. Restart spark-shell

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
With Liancheng's suggestion, I've tried setting spark.sql.hive.convertMetastoreParquet false but still analyze noscan return -1 in rawDataSize Jianshi On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: If I run ANALYZE without NOSCAN, then Hive can successfully