Re: LIMIT issue of SparkSQL

2016-10-23 Thread Michael Armbrust
- dev + user Can you give more info about the query? Maybe a full explain()? Are you using a datasource like JDBC? The API does not currently push down limits, but the documentation talks about how you can use a query instead of a table if that is what you are looking to do. On Mon, Oct 24, 20

How to avoid the delay associated with Hive Metastore when loading parquet?

2016-10-23 Thread ankits
Hi, I'm loading parquet files via spark, and I see the first time a file is loaded that there is a 5-10s delay related to the Hive Metastore with messages relating to metastore in the console. How can I avoid this delay and keep the metadata around? I want the data to be persisted even after kill

Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang
thanks, this direction seems to be inline with what I want. what i really want is groupBy() and then for the rows in each group, get an Iterator, and run each element from the iterator through a local function (specifically SGD), right now the DataSet API provides this , but it's literally an Iter

Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang
thanks. exactly this is what I ended up doing finally. though it seemed to work, there seems to be guarantee that the randomness after the sortWithinPartitions() would be preserved after I do a further groupBy. On Fri, Oct 21, 2016 at 3:55 PM, Cheng Lian wrote: > I think it would much easier

Re: PostgresSql queries vs spark sql

2016-10-23 Thread Selvam Raman
I found it. We can use pivot which is similar to cross tab In postgres. Thank you. On Oct 17, 2016 10:00 PM, "Selvam Raman" wrote: > Hi, > > Please share me some idea if you work on this earlier. > How can i develop postgres CROSSTAB function in spark. > > Postgres Example > > Example 1: > > SEL

Spark submit running spark-sql-perf and additional jar

2016-10-23 Thread Mr rty ff
Hi I run the following script home/spark-2.0.1-bin-hadoop2.7/bin/spark-submit --conf "someconf" "--jars /home/user/workspace/auxdriver/target/auxdriver.jar,/media/sf_VboxShared/tpc-ds/spark-sql-perf-v.0.2.4/spark-sql-perf-assembly-0.2.4.jar --benchmark DatabasePerformance --iterations 1 --spark

Random forest classifier error : Size exceeds Integer.MAX_VALUE

2016-10-23 Thread Kürşat Kurt
Hi; I am trying to train Random forest classifier. I have predefined classification set (classifications.csv , ~300.000 line) While fitting, i am getting "Size exceeds Integer.MAX_VALUE" error. Here is the code: object Test1 { var savePath = "c:/Temp/SparkModel/" var stemme

Spark streaming crashes with high throughput

2016-10-23 Thread Jeyhun Karimov
Hi, I am getting *Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.* error with spark streaming job. I am using spark 2.0.0. The job is simple windowed aggregation and the stream is read from socket. Average t

Re: HashingTF for TF.IDF computation

2016-10-23 Thread Ciumac Sergiu
Thanks Yanbo! On Sun, Oct 23, 2016 at 1:57 PM, Yanbo Liang wrote: > HashingTF was not designed to handle your case, you can try > CountVectorizer who will keep the original terms as vocabulary for > retrieving. CountVectorizer will compute a global term-to-index map, > which can be expensive for

Dataflow of Spark/Hadoop in steps

2016-10-23 Thread Or Raz
I would like to know if I have 100 GB data and I would like to find the most common world ,actually what is going on in my cluster(lets say a master node and 6 workers) step by step.(1) what does the master(2)? start the mapreduce job, monitor the traffic and return the result? the same goes for w

Re: HashingTF for TF.IDF computation

2016-10-23 Thread Yanbo Liang
HashingTF was not designed to handle your case, you can try CountVectorizer who will keep the original terms as vocabulary for retrieving. CountVectorizer will compute a global term-to-index map, which can be expensive for a large corpus and has the risk of OOM. IDF can accept feature vectors gener