query over hive context hangs, please help

2015-07-21 Thread
The thread dump is here, seems hang on accessing mysql meta store. I googled and find a bug related to com.mysql.jdbc.util.ReadAheadInputStream, but don't have a workaround. And I am not sure about that. please help me. thanks. thread dump--- MyAppDefaultScheduler_Worker-2 prio=10

Re: Java 8 vs Scala

2015-07-15 Thread
I think different team got different answer for this question. my team use scala, and happy with it. On Wed, Jul 15, 2015 at 1:31 PM, Tristan Blakers tris...@blackfrog.org wrote: We have had excellent results operating on RDDs using Java 8 with Lambdas. It’s slightly more verbose than Scala,

Re: Re: how to use DoubleRDDFunctions on mllib Vector?

2015-07-09 Thread
/GaussianMixture.scala At 2015-07-09 10:10:58, 诺铁 noty...@gmail.com wrote: thanks, I understand now. but I can't find mllib.clustering.GaussianMixture#vectorMean , what version of spark do you use? On Thu, Jul 9, 2015 at 1:16 AM, Feynman Liang fli...@databricks.com wrote: A RDD[Double

how to use DoubleRDDFunctions on mllib Vector?

2015-07-07 Thread
hi, there are some useful functions in DoubleRDDFunctions, which I can use if I have RDD[Double], eg, mean, variance. Vector doesn't have such methods, how can I convert Vector to RDD[Double], or maybe better if I can call mean directly on a Vector?

how to do table partitioning efficiently?

2015-06-26 Thread
hi, now I'm doing something like this on a data frame to make use of table partitioning df.filter($sex === male).write.parquet(path/to/table/sex=male) df.filter($sex === female).write.parquet(path/to/table/sex=female) this will filter dataset multiple times, are there better way to do this?

how to create custom data source?

2015-06-24 Thread
hi, I want to use spark to analyze source code :) Since code have dependency between lines, it's not possible to just treat it as lines. So I am considering to provide my own datasource for source code, but there isn't much documentation about datasource api, where can I learn to do this?

Re: Does filter on an RDD scan every data item ?

2014-12-07 Thread
there is a *PartitionPruningRDD* :: DeveloperApi :: A RDD used to prune RDD partitions/partitions so we can avoid launching tasks on all partitions. An example use case: If we know the RDD is partitioned by range, and the execution DAG has a filter on the key, we can avoid launching tasks on

can't get smallint field from hive on spark

2014-11-26 Thread
hi, don't know whether this question should be asked here, if not, please point me out, thanks. we are currently using hive on spark, when reading a small int field, it reports error: Cannot get field 'i16Val' because union is currently set to i32Val I googled and find only source code of

Re: can't get smallint field from hive on spark

2014-11-26 Thread
to ask (see https://hive.apache.org/mailing_lists.html). Thanks, Yin On Wed, Nov 26, 2014 at 10:49 PM, 诺铁 noty...@gmail.com wrote: thank you very much. On Thu, Nov 27, 2014 at 11:30 AM, Michael Armbrust mich...@databricks.com wrote: This has been fixed in Spark 1.1.1 and Spark 1.2

SparkContext creation slow down unit tests

2014-09-16 Thread
hi, I am trying to write some unit test, following spark programming guide http://spark.apache.org/docs/latest/programming-guide.html#unit-testing. but I observed unit test runs very slow(the code is just a SparkPi), so I turn log level to trace and look through the log output. and found

Re: SparkContext creation slow down unit tests

2014-09-16 Thread
I connect my sample project to a hosted CI service, it only takes 3 seconds to run there...while the same tests takes 2minutes on my macbook pro. so maybe this is a mac os specific problem? On Tue, Sep 16, 2014 at 3:06 PM, 诺铁 noty...@gmail.com wrote: hi, I am trying to write some unit test

Re: SparkContext creation slow down unit tests

2014-09-16 Thread
sorry for disturb, please ignore this mail in the end, I find it slow because lack of memory in my machine.. sorry again. On Tue, Sep 16, 2014 at 3:26 PM, 诺铁 noty...@gmail.com wrote: I connect my sample project to a hosted CI service, it only takes 3 seconds to run there...while the same

Re: how to split RDD by key and save to different path

2014-08-12 Thread
with the same key in the same partition } } 2014-08-11 20:42 GMT+08:00 诺铁 noty...@gmail.com: hi, I have googled and find similar question without good answer, http://stackoverflow.com/questions/24520225/writing-to-hadoop-distributed-file-system-multiple-times-with-spark in short, I would like

how to split RDD by key and save to different path

2014-08-11 Thread
hi, I have googled and find similar question without good answer, http://stackoverflow.com/questions/24520225/writing-to-hadoop-distributed-file-system-multiple-times-with-spark in short, I would like to separate raw data and divide by some key, for example, create date, and put the in directory

how to use SPARK_PUBLIC_DNS

2014-08-10 Thread
hi, all, I am playing with docker, trying to create a spark cluster with docker containers. since spark master, worker, driver all need to visit each other, I configured a dns server, and set hostname and domain name of each node. but when spark master start up, it seems to be using hostname

Re: Save an RDD to a SQL Database

2014-08-07 Thread
I haven't seen people write directly to sql database, mainly because it's difficult to deal with failure, what if network broken in half of the process? should we drop all data in database and restart from beginning? if the process is Appending data to database, then things becomes even complex.

Re: Is any idea on architecture based on Spark + Spray + Akka

2014-05-04 Thread
hello,ZhangYi I find ooyala's opensourced spark-jobserver, https://github.com/ooyala/spark-jobserver seems that they are also using akka and spray and spark, maybe helpful for you. On Mon, May 5, 2014 at 11:37 AM, ZhangYi yizh...@thoughtworks.com wrote: Hi all, Currently, our project is

confused by reduceByKey usage

2014-04-17 Thread
HI, I am new to spark,when try to write some simple tests in spark shell, I met following problem. I create a very small text file,name it as 5.txt 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 and experiment in spark shell: scala val d5 = sc.textFile(5.txt).cache() d5: org.apache.spark.rdd.RDD[String] =

Re: confused by reduceByKey usage

2014-04-17 Thread
instead: scala d5.keyBy(_.split( )(0)).mapValues(_.split( )(1).toInt).reduceByKey((v1, v2) = v1 + v2).collect On Thu, Apr 17, 2014 at 6:29 PM, 诺铁 noty...@gmail.com wrote: HI, I am new to spark,when try to write some simple tests in spark shell, I met following problem. I create a very

Re: confused by reduceByKey usage

2014-04-17 Thread
/YARN/Mesos), output of println goes to executor stdout. On Fri, Apr 18, 2014 at 6:53 AM, 诺铁 noty...@gmail.com wrote: yeah, I got it.! using println to debug is great for me to explore spark. thank you very much for your kindly help. On Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos

Re: confused by reduceByKey usage

2014-04-17 Thread
, 诺铁 noty...@gmail.com wrote: hi,Cheng, thank you for let me know this. so what do you think is better way to debug? On Fri, Apr 18, 2014 at 9:27 AM, Cheng Lian lian.cs@gmail.comwrote: A tip: using println is only convenient when you are working with local mode. When running Spark