Re: [spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-28 Thread Michael Armbrust
In this case I'd probably just store it as a String. Our casting rules (which come from Hive) are such that when you use a string as an number of boolean it will be casted to the desired type. Thanks for the PR btw :) On Fri, Mar 27, 2015 at 2:31 PM, Eran Medan ehrann.meh...@gmail.com wrote:

Re: Add partition support in saveAsParquet

2015-03-28 Thread Michael Armbrust
This is something we are hoping to support in Spark 1.4. We'll post more information to JIRA when there is a design. On Thu, Mar 26, 2015 at 11:22 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, Anyone has similar request? https://issues.apache.org/jira/browse/SPARK-6561 When we

Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread davidshen84
Hi Jao, Sorry to pop up this old thread. I am have the same problem like you did. I want to know if you have figured out how to improve k-means on Spark. I am using Spark 1.2.0. My data set is about 270k vectors, each has about 350 dimensions. If I set k=500, the job takes about 3hrs on my

Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread Burak Yavuz
Hi David, Can you also try with Spark 1.3 if possible? I believe there was a 2x improvement on K-Means between 1.2 and 1.3. Thanks, Burak On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 davidshe...@gmail.com wrote: Hi Jao, Sorry to pop up this old thread. I am have the same problem like you

Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Vincent He
I am learning spark sql and try spark-sql example, I running following code, but I got exception ERROR CliDriver: org.apache.spark.sql.AnalysisException: cannot recognize input near 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17, I have two questions, 1. Do we have a list of the

Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Vincent He
I am learning spark sql and try spark-sql example, I running following code, but I got exception ERROR CliDriver: org.apache.spark.sql.AnalysisException: cannot recognize input near 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17, I have two questions, 1. Do we have a list of the

How to add all combinations of items rated by user and difference between the ratings?

2015-03-28 Thread anishm
The input file is of format: userid, movieid, rating From this plan, I want to extract all possible combinations of movies and difference between the ratings for each user. (movie1, movie2),(rating(movie1)-rating(movie2)) This process should be processed for each user in the dataset. Finally, I

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Ted Yu
Please take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html Cheers On Mar 28, 2015, at 5:08 AM, Vincent He vincent.he.andr...@gmail.com wrote: I am learning spark sql and try spark-sql example, I running following code, but I got exception ERROR CliDriver:

Custom edge partitioning in graphX

2015-03-28 Thread arpp
Hi all, I am working with spark 1.0.0. mainly for the usage of GraphX and wished to apply some custom partitioning strategies on the edge list of the graph. I have generated an edge list file which has the partition number after the source and destination id in each line. Initially I am loading

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Ted Yu
See https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html I haven't tried the SQL statements in above blog myself. Cheers On Sat, Mar 28, 2015 at 5:39 AM, Vincent He vincent.he.andr...@gmail.com wrote: thanks for your information . I have read it, I can run

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Vincent He
thanks for your information . I have read it, I can run sample with scala or python, but for spark-sql shell, I can not get an exmaple running successfully, can you give me an example I can run with ./bin/spark-sql without writing any code? thanks On Sat, Mar 28, 2015 at 7:35 AM, Ted Yu

Re: Can't access file in spark, but can in hadoop

2015-03-28 Thread Ted Yu
Thanks for the follow-up, Dale. bq. hdp 2.3.1 Minor correction: should be hdp 2.1.3 Cheers On Sat, Mar 28, 2015 at 2:28 AM, Johnson, Dale daljohn...@ebay.com wrote: Actually I did figure this out eventually. I’m running on a Hortonworks cluster hdp 2.3.1 (hadoop 2.4.1). Spark bundles

input size too large | Performance issues with Spark

2015-03-28 Thread nsareen
Hi All, I'm facing performance issues with spark implementation, and was briefly investigating on WebUI logs, i noticed that my RDD size is 55GB the Shuffle Write is 10 GB Input Size is 200GB. Application is a web application which does predictive analytics, so we keep most of our data in

[Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-28 Thread Nathan Marin
Hi, I’ve been trying to use Spark Streaming for my real-time analysis application using the Kafka Stream API on a cluster (using the yarn version) of 6 executors with 4 dedicated cores and 8192mb of dedicated RAM. The thing is, my application should run 24/7 but the disk usage is leaking. This

Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Steve Loughran
It's worth adding that there's no guaranteed that re-evaluated work would be on the same host as before, and in the case of node failure, it is not guaranteed to be elsewhere. this means things that depend on host-local information is going to generate different numbers even if there are no

Re: Spark-submit not working when application jar is in hdfs

2015-03-28 Thread rrussell25
Hi, did you resolve this issue or just work around it be keeping your application jar local? Running into the same issue with 1.3. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840p22272.html

Re: Spark-submit not working when application jar is in hdfs

2015-03-28 Thread Ted Yu
Looking at SparkSubmit#addJarToClasspath(): uri.getScheme match { case file | local = ... case _ = printWarning(sSkip remote jar $uri.) It seems hdfs scheme is not recognized. FYI On Thu, Feb 26, 2015 at 6:09 PM, dilm dmend...@exist.com wrote: I'm trying to run a

Re: Understanding Spark Memory distribution

2015-03-28 Thread Wisely Chen
Hi Ankur If your hardware is ok, looks like it is config problem. Can you show me the config of spark-env.sh or JVM config? Thanks Wisely Chen 2015-03-28 15:39 GMT+08:00 Ankur Srivastava ankur.srivast...@gmail.com: Hi Wisely, I have 26gb for driver and the master is running on m3.2xlarge

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-28 Thread Michael Stone
I've also been having trouble running 1.3.0 on HDP. The spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 configuration directive seems to work with pyspark, but not propagate when using spark-shell. (That is, everything works find with pyspark, and spark-shell fails with the bad

Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Aaron Davidson
Note that speculation is off by default to avoid these kinds of unexpected issues. On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran ste...@hortonworks.com wrote: It's worth adding that there's no guaranteed that re-evaluated work would be on the same host as before, and in the case of node

Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Michal Klos
got it thanks. Making sure everything is idempotent is definitely a critical piece for peace of mind. On Sat, Mar 28, 2015 at 1:47 PM, Aaron Davidson ilike...@gmail.com wrote: Note that speculation is off by default to avoid these kinds of unexpected issues. On Sat, Mar 28, 2015 at 6:21 AM,

Re: rdd.toDF().saveAsParquetFile(tachyon://host:19998/test)

2015-03-28 Thread Yin Huai
You are hitting https://issues.apache.org/jira/browse/SPARK-6330. It has been fixed in 1.3.1, which will be released soon. On Fri, Mar 27, 2015 at 10:42 PM, sud_self 852677...@qq.com wrote: spark version is 1.3.0 with tanhyon-0.6.1 QUESTION DESCRIPTION:

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
I have put more detail of my problem at http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed It is really appreciate if you can help me take a look at this problem. I have tried various settings and ways to load/partition my data, but I just cannot get rid

Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏
Could someone please share the spark-submit command that shows their mysql jar containing driver class used to connect to Hive MySQL meta store. Even after including it through --driver-class-path /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar OR (AND) --jars

Re: Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏
This is from my Hive installation -sh-4.1$ ls /apache/hive/lib | grep derby derby-10.10.1.1.jar derbyclient-10.10.1.1.jar derbynet-10.10.1.1.jar -sh-4.1$ ls /apache/hive/lib | grep datanucleus datanucleus-api-jdo-3.2.6.jar datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar

Re: Understanding Spark Memory distribution

2015-03-28 Thread Wisely Chen
Hi In broadcast, spark will collect the whole 3gb object into master node and broadcast to each slaves. It is very common situation that the master node don't have enough memory . What is your master node settings? Wisely Chen Ankur Srivastava ankur.srivast...@gmail.com 於 2015年3月28日 星期六寫道: I

Re: Can spark sql read existing tables created in hive

2015-03-28 Thread ๏̯͡๏
Yes am using yarn-cluster and i did add it via --files. I get Suitable error not found error Please share the spark-submit command that shows mysql jar containing driver class used to connect to Hive MySQL meta store. Even after including it through --driver-class-path

Re: Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏
This is what am seeing ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar --jars

Re: Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏
I tried with a different version of driver but same error ./bin/spark-submit -v --master yarn-cluster --driver-class-path

Re: Can't access file in spark, but can in hadoop

2015-03-28 Thread Johnson, Dale
Actually I did figure this out eventually. I’m running on a Hortonworks cluster hdp 2.3.1 (hadoop 2.4.1). Spark bundles the org/apache/hadoop/hdfs/… classes along with the spark-assembly jar. This turns out to introduce a small incompatibility with hdp 2.3.1. I carved these classes out of

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Reza Zadeh
How many dimensions does your data have? The size of the k-means model is k * d, where d is the dimension of the data. Since you're using k=1000, if your data has dimension higher than say, 10,000, you will have trouble, because k*d doubles have to fit in the driver. Reza On Sat, Mar 28, 2015

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
My vector dimension is like 360 or so. The data count is about 270k. My driver has 2.9G memory. I attache a screenshot of current executor status. I submitted this job with --master yarn-cluster. I have a total of 7 worker node, one of them acts as the driver. In the screenshot, you can see all