spark disk-to-disk

2015-03-22 Thread Koert Kuipers
i would like to use spark for some algorithms where i make no attempt to work in memory, so read from hdfs and write to hdfs for every step. of course i would like every step to only be evaluated once. and i have no need for spark's RDD lineage info, since i persist to reliable storage. the

Re: converting DStream[String] into RDD[String] in spark streaming

2015-03-22 Thread deenar.toraskar
Sean Dstream.saveAsTextFiles internally calls foreachRDD and saveAsTextFile for each interval def saveAsTextFiles(prefix: String, suffix: String = ) { val saveFunc = (rdd: RDD[T], time: Time) = { val file = rddToFileName(prefix, suffix, time) rdd.saveAsTextFile(file) }

Spark sql thrift server slower than hive

2015-03-22 Thread fanooos
We have cloudera CDH 5.3 installed on one machine. We are trying to use spark sql thrift server to execute some analysis queries against hive table. Without any changes in the configurations, we run the following query on both hive and spark sql thrift server *select * from tableName;* The

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Xi Shen
Hi Ted, I have tried to invoke the command from both cygwin environment and powershell environment. I still get the messages: 15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 15/03/22 21:56:00 WARN netlib.BLAS: Failed to load

Re: How to set Spark executor memory?

2015-03-22 Thread Xi Shen
OK, I actually got the answer days ago from StackOverflow, but I did not check it :( When running in local mode, to set the executor memory - when using spark-submit, use --driver-memory - when running as a Java application, like executing from IDE, set the -Xmx vm option Thanks, David On

DataFrame saveAsTable - partitioned tables

2015-03-22 Thread deenar.toraskar
Hi I wanted to store DataFrames as partitioned Hive tables. Is there a way to do this via the saveAsTable call. The set of options does not seem to be documented. def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit (Scala-specific) Creates a

Re: Load balancing

2015-03-22 Thread Jeffrey Jedele
Hi Mohit, please make sure you use the Reply to all button and include the mailing list, otherwise only I will get your message ;) Regarding your question: Yes, that's also my understanding. You can partition streaming RDDs only by time intervals, not by size. So depending on your incoming rate,

Should Spark SQL support retrieve column value from Row by column name?

2015-03-22 Thread amghost
I would like to retrieve column value from Spark SQL query result. But currently it seems that Spark SQL only support retrieving by index val results = sqlContext.sql(SELECT name FROM people) results.map(t = Name: + *t(0)*).collect().foreach(println) I think it will be much more convenient if

Re: Should Spark SQL support retrieve column value from Row by column name?

2015-03-22 Thread Yanbo Liang
If you use the latest version Spark 1.3, you can use the DataFrame API like val results = sqlContext.sql(SELECT name FROM people) results.select(name).show() 2015-03-22 15:40 GMT+08:00 amghost zhengweita...@gmail.com: I would like to retrieve column value from Spark SQL query result. But

Re: How to do nested foreach with RDD

2015-03-22 Thread Xi Shen
Hi Reza, Yes, I just found RDD.cartesian(). Very useful. Thanks, David On Sun, Mar 22, 2015 at 5:08 PM Reza Zadeh r...@databricks.com wrote: You can do this with the 'cartesian' product method on RDD. For example: val rdd1 = ... val rdd2 = ... val combinations =

Re: converting DStream[String] into RDD[String] in spark streaming

2015-03-22 Thread Sean Owen
On Sun, Mar 22, 2015 at 8:43 AM, deenar.toraskar deenar.toras...@db.com wrote: 1) if there are no sliding window calls in this streaming context, will there just one file written per interval? As many files as there are partitions will be written in each interval. 2) if there is a sliding

Re: ArrayIndexOutOfBoundsException in ALS.trainImplicit

2015-03-22 Thread Sabarish Sasidharan
My bad. This was an outofmemory disguised as something else. Regards Sab On Sun, Mar 22, 2015 at 1:53 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: I am consistently running into this ArrayIndexOutOfBoundsException issue when using trainImplicit. I have tried changing the

How to use DataFrame with MySQL

2015-03-22 Thread gavin zhang
OK, I have known that I could use jdbc connector to create DataFrame with this command: val jdbcDF = sqlContext.load(jdbc, Map(url - jdbc:mysql://localhost:3306/video_rcmd?user=rootpassword=123456, dbtable - video)) But I got this error: java.sql.SQLException: No suitable driver found for ...

lowerupperBound not working/spark 1.3

2015-03-22 Thread Marek Wiewiorka
Hi All - I try to use the new SQLContext API for populating DataFrame from jdbc data source. like this: val jdbcDF = sqlContext.jdbc(url = jdbc:postgresql://localhost:5430/dbname?user=userpassword=111, table = se_staging.exp_table3 ,columnName=cs_id,lowerBound=1 ,upperBound = 1,

Re: How Does aggregate work

2015-03-22 Thread Ted Yu
I assume spark.default.parallelism is 4 in the VM Ashish was using. Cheers

How Does aggregate work

2015-03-22 Thread ashish.usoni
Hi , I am not able to understand how aggregate function works, Can some one please explain how below result came I am running spark using cloudera VM The result in below is 17 but i am not able to find out how it is calculating 17 val data = sc.parallelize(List(2,3,4)) data.aggregate(0)((x,y)

Re: Error while installing Spark 1.3.0 on local machine

2015-03-22 Thread Dean Wampler
Any particular reason you're not just downloading a build from http://spark.apache.org/downloads.html Even if you aren't using Hadoop, any of those builds will work. If you want to build from source, the Maven build is more reliable. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd

How to check that a dataset is sorted after it has been written out? [repost]

2015-03-22 Thread Michael Albert
Greetings![My apologies for this respost, I'm not certain that the first message made it to the list]. I sorted a dataset in Spark and then wrote it out in avro/parquet. Then I wanted to check that it was sorted. It looks like each partition has been sorted, but when reading in, the first

Re: can distinct transform applied on DStream?

2015-03-22 Thread Dean Wampler
aDstream.transform(_.distinct()) will only make the elements of each RDD in the DStream distinct, not for the whole DStream globally. Is that what you're seeing? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Ted Yu
How about pointing LD_LIBRARY_PATH to native lib folder ? You need Spark 1.2.0 or higher for the above to work. See SPARK-1719 Cheers On Sun, Mar 22, 2015 at 4:02 AM, Xi Shen davidshe...@gmail.com wrote: Hi Ted, I have tried to invoke the command from both cygwin environment and powershell

Re: How Does aggregate work

2015-03-22 Thread Dean Wampler
2 is added every time the final partition aggregator is called. The result of summing the elements across partitions is 9 of course. If you force a single partition (using spark-shell in local mode): scala val data = sc.parallelize(List(2,3,4),1) scala data.aggregate(0)((x,y) = x+y,(x,y) = 2+x+y)

Re: Load balancing

2015-03-22 Thread Mohit Anchlia
posting my question again :) Thanks for the pointer, looking at the below description from the site it looks like in spark block size is not fixed, it's determined by block interval and in fact for the same batch you could have different block sizes. Did I get it right? - Another

Re: lowerupperBound not working/spark 1.3

2015-03-22 Thread Ted Yu
From javadoc of JDBCRelation#columnPartition(): * Given a partitioning schematic (a column of integral type, a number of * partitions, and upper and lower bounds on the column's value), generate In your example, 1 and 1 are for the value of cs_id column. Looks like all the values in

Re: Should Spark SQL support retrieve column value from Row by column name?

2015-03-22 Thread Michael Armbrust
Please open a JIRA, we added the info to Row that will allow this to happen, but we need to provide the methods you are asking for. I'll add that this does work today in python (i.e. row.columnName). On Sun, Mar 22, 2015 at 12:40 AM, amghost zhengweita...@gmail.com wrote: I would like to

Re: lowerupperBound not working/spark 1.3

2015-03-22 Thread Marek Wiewiorka
...I even tried setting upper/lower bounds to the same value like 1 or 10 with the same result. cs_id is a column of the cardinality ~5*10^6 So this is not the case here. Regards, Marek 2015-03-22 20:30 GMT+01:00 Ted Yu yuzhih...@gmail.com: From javadoc of JDBCRelation#columnPartition(): *

Re: lowerupperBound not working/spark 1.3

2015-03-22 Thread Ted Yu
I went over JDBCRelation#columnPartition() but didn't find obvious clue (you can add more logging to confirm that the partitions were generated correctly). Looks like the issue may be somewhere else. Cheers On Sun, Mar 22, 2015 at 12:47 PM, Marek Wiewiorka marek.wiewio...@gmail.com wrote:

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Burak Yavuz
Did you build Spark with: -Pnetlib-lgpl? Ref: https://spark.apache.org/docs/latest/mllib-guide.html Burak On Sun, Mar 22, 2015 at 7:37 AM, Ted Yu yuzhih...@gmail.com wrote: How about pointing LD_LIBRARY_PATH to native lib folder ? You need Spark 1.2.0 or higher for the above to work. See

Re: DataFrame saveAsTable - partitioned tables

2015-03-22 Thread Michael Armbrust
Note you can use HiveQL syntax for creating dynamically partitioned tables though. On Sun, Mar 22, 2015 at 1:29 PM, Michael Armbrust mich...@databricks.com wrote: Not yet. This is on the roadmap for Spark 1.4. On Sun, Mar 22, 2015 at 12:19 AM, deenar.toraskar deenar.toras...@db.com wrote:

Re: DataFrame saveAsTable - partitioned tables

2015-03-22 Thread Michael Armbrust
Not yet. This is on the roadmap for Spark 1.4. On Sun, Mar 22, 2015 at 12:19 AM, deenar.toraskar deenar.toras...@db.com wrote: Hi I wanted to store DataFrames as partitioned Hive tables. Is there a way to do this via the saveAsTable call. The set of options does not seem to be documented.

Re: join two DataFrames, same column name

2015-03-22 Thread Michael Armbrust
You can include * and a column alias in the same select clause var df1 = sqlContext.sql(select *, column_id AS table1_id from table1) I'm also hoping to resolve SPARK-6376 https://issues.apache.org/jira/browse/SPARK-6376 before Spark 1.3.1 which will let you do something like: var df1 =

Re: 'nested' RDD problem, advise needed

2015-03-22 Thread Victor Tso-Guillen
Something like this? (2 to alphabetLength toList).map(shift = Object.myFunction(inputRDD, shift).map(v = shift - v).foldLeft(Object.myFunction(inputRDD, 1).map(v = 1 - v))(_ union _) which is an RDD[(Int, Char)] Problem is that you can't play with RDDs inside of RDDs. The recursive structure

Convert Spark SQL table to RDD in Scala / error: value toFloat is a not a member of Any

2015-03-22 Thread Minnow Noir
I'm following some online tutorial written in Python and trying to convert a Spark SQL table object to an RDD in Scala. The Spark SQL just loads a simple table from a CSV file. The tutorial says to convert the table to an RDD. The Python is products_rdd = sqlContext.table(products).map(lambda

Re: How to do nested foreach with RDD

2015-03-22 Thread Reza Zadeh
You can do this with the 'cartesian' product method on RDD. For example: val rdd1 = ... val rdd2 = ... val combinations = rdd1.cartesian(rdd2).filter{ case (a,b) = a b } Reza On Sat, Mar 21, 2015 at 10:37 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have two big RDD, and I need to do

Re: can distinct transform applied on DStream?

2015-03-22 Thread Akhil Das
What do you mean not distinct? It does works for me: [image: Inline image 1] Code: import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.{SparkContext, SparkConf} val ssc = new StreamingContext(sc, Seconds(1)) val data =

Re: Convert Spark SQL table to RDD in Scala / error: value toFloat is a not a member of Any

2015-03-22 Thread Cheng Lian
You need either |.map { row = (row(0).asInstanceOf[Float], row(1).asInstanceOf[Float], ...) } | or |.map {case Row(f0:Float, f1:Float, ...) = (f0, f1) } | On 3/23/15 9:08 AM, Minnow Noir wrote: I'm following some online tutorial written in Python and trying to convert a Spark SQL table

Re: Convert Spark SQL table to RDD in Scala / error: value toFloat is a not a member of Any

2015-03-22 Thread Ted Yu
I thought of formation #1. But looks like when there're many fields, formation #2 is cleaner. Cheers On Sun, Mar 22, 2015 at 8:14 PM, Cheng Lian lian.cs@gmail.com wrote: You need either .map { row = (row(0).asInstanceOf[Float], row(1).asInstanceOf[Float], ...) } or .map { case

SocketTimeout only when launching lots of executors

2015-03-22 Thread Tianshuo Deng
Hi, spark users. When running a spark application with lots of executors(300+), I see following failures: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at

Re: Spark sql thrift server slower than hive

2015-03-22 Thread Denny Lee
How are you running your spark instance out of curiosity? Via YARN or standalone mode? When connecting Spark thriftserver to the Spark service, have you allocated enough memory and CPU when executing with spark? On Sun, Mar 22, 2015 at 3:39 AM fanooos dev.fano...@gmail.com wrote: We have

Re: spark disk-to-disk

2015-03-22 Thread Reynold Xin
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote: so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...) but that seems like a rather broken abstraction. This seems like a fine solution to me.