i would like to use spark for some algorithms where i make no attempt to
work in memory, so read from hdfs and write to hdfs for every step.
of course i would like every step to only be evaluated once. and i have no
need for spark's RDD lineage info, since i persist to reliable storage.
the
Sean
Dstream.saveAsTextFiles internally calls foreachRDD and saveAsTextFile for
each interval
def saveAsTextFiles(prefix: String, suffix: String = ) {
val saveFunc = (rdd: RDD[T], time: Time) = {
val file = rddToFileName(prefix, suffix, time)
rdd.saveAsTextFile(file)
}
We have cloudera CDH 5.3 installed on one machine.
We are trying to use spark sql thrift server to execute some analysis
queries against hive table.
Without any changes in the configurations, we run the following query on
both hive and spark sql thrift server
*select * from tableName;*
The
Hi Ted,
I have tried to invoke the command from both cygwin environment and
powershell environment. I still get the messages:
15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
15/03/22 21:56:00 WARN netlib.BLAS: Failed to load
OK, I actually got the answer days ago from StackOverflow, but I did not
check it :(
When running in local mode, to set the executor memory
- when using spark-submit, use --driver-memory
- when running as a Java application, like executing from IDE, set the
-Xmx vm option
Thanks,
David
On
Hi
I wanted to store DataFrames as partitioned Hive tables. Is there a way to
do this via the saveAsTable call. The set of options does not seem to be
documented.
def
saveAsTable(tableName: String, source: String, mode: SaveMode, options:
Map[String, String]): Unit
(Scala-specific) Creates a
Hi Mohit,
please make sure you use the Reply to all button and include the mailing
list, otherwise only I will get your message ;)
Regarding your question:
Yes, that's also my understanding. You can partition streaming RDDs only by
time intervals, not by size. So depending on your incoming rate,
I would like to retrieve column value from Spark SQL query result. But
currently it seems that Spark SQL only support retrieving by index
val results = sqlContext.sql(SELECT name FROM people)
results.map(t = Name: + *t(0)*).collect().foreach(println)
I think it will be much more convenient if
If you use the latest version Spark 1.3, you can use the DataFrame API like
val results = sqlContext.sql(SELECT name FROM people)
results.select(name).show()
2015-03-22 15:40 GMT+08:00 amghost zhengweita...@gmail.com:
I would like to retrieve column value from Spark SQL query result. But
Hi Reza,
Yes, I just found RDD.cartesian(). Very useful.
Thanks,
David
On Sun, Mar 22, 2015 at 5:08 PM Reza Zadeh r...@databricks.com wrote:
You can do this with the 'cartesian' product method on RDD. For example:
val rdd1 = ...
val rdd2 = ...
val combinations =
On Sun, Mar 22, 2015 at 8:43 AM, deenar.toraskar deenar.toras...@db.com wrote:
1) if there are no sliding window calls in this streaming context, will
there just one file written per interval?
As many files as there are partitions will be written in each interval.
2) if there is a sliding
My bad. This was an outofmemory disguised as something else.
Regards
Sab
On Sun, Mar 22, 2015 at 1:53 AM, Sabarish Sasidharan
sabarish.sasidha...@manthan.com wrote:
I am consistently running into this ArrayIndexOutOfBoundsException issue
when using trainImplicit. I have tried changing the
OK, I have known that I could use jdbc connector to create DataFrame with
this command:
val jdbcDF = sqlContext.load(jdbc, Map(url -
jdbc:mysql://localhost:3306/video_rcmd?user=rootpassword=123456,
dbtable - video))
But I got this error:
java.sql.SQLException: No suitable driver found for ...
Hi All - I try to use the new SQLContext API for populating DataFrame from
jdbc data source.
like this:
val jdbcDF = sqlContext.jdbc(url =
jdbc:postgresql://localhost:5430/dbname?user=userpassword=111, table =
se_staging.exp_table3 ,columnName=cs_id,lowerBound=1 ,upperBound =
1,
I assume spark.default.parallelism is 4 in the VM Ashish was using.
Cheers
Hi ,
I am not able to understand how aggregate function works, Can some one
please explain how below result came
I am running spark using cloudera VM
The result in below is 17 but i am not able to find out how it is
calculating 17
val data = sc.parallelize(List(2,3,4))
data.aggregate(0)((x,y)
Any particular reason you're not just downloading a build from
http://spark.apache.org/downloads.html Even if you aren't using Hadoop, any
of those builds will work.
If you want to build from source, the Maven build is more reliable.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd
Greetings![My apologies for this respost, I'm not certain that the first
message made it to the list].
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
aDstream.transform(_.distinct()) will only make the elements of each RDD
in the DStream distinct, not for the whole DStream globally. Is that what
you're seeing?
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe
How about pointing LD_LIBRARY_PATH to native lib folder ?
You need Spark 1.2.0 or higher for the above to work. See SPARK-1719
Cheers
On Sun, Mar 22, 2015 at 4:02 AM, Xi Shen davidshe...@gmail.com wrote:
Hi Ted,
I have tried to invoke the command from both cygwin environment and
powershell
2 is added every time the final partition aggregator is called. The result
of summing the elements across partitions is 9 of course. If you force a
single partition (using spark-shell in local mode):
scala val data = sc.parallelize(List(2,3,4),1)
scala data.aggregate(0)((x,y) = x+y,(x,y) = 2+x+y)
posting my question again :)
Thanks for the pointer, looking at the below description from the site it
looks like in spark block size is not fixed, it's determined by block
interval and in fact for the same batch you could have different block
sizes. Did I get it right?
-
Another
From javadoc of JDBCRelation#columnPartition():
* Given a partitioning schematic (a column of integral type, a number of
* partitions, and upper and lower bounds on the column's value), generate
In your example, 1 and 1 are for the value of cs_id column.
Looks like all the values in
Please open a JIRA, we added the info to Row that will allow this to
happen, but we need to provide the methods you are asking for. I'll add
that this does work today in python (i.e. row.columnName).
On Sun, Mar 22, 2015 at 12:40 AM, amghost zhengweita...@gmail.com wrote:
I would like to
...I even tried setting upper/lower bounds to the same value like 1 or 10
with the same result.
cs_id is a column of the cardinality ~5*10^6
So this is not the case here.
Regards,
Marek
2015-03-22 20:30 GMT+01:00 Ted Yu yuzhih...@gmail.com:
From javadoc of JDBCRelation#columnPartition():
*
I went over JDBCRelation#columnPartition() but didn't find obvious clue
(you can add more logging to confirm that the partitions were generated
correctly).
Looks like the issue may be somewhere else.
Cheers
On Sun, Mar 22, 2015 at 12:47 PM, Marek Wiewiorka marek.wiewio...@gmail.com
wrote:
Did you build Spark with: -Pnetlib-lgpl?
Ref: https://spark.apache.org/docs/latest/mllib-guide.html
Burak
On Sun, Mar 22, 2015 at 7:37 AM, Ted Yu yuzhih...@gmail.com wrote:
How about pointing LD_LIBRARY_PATH to native lib folder ?
You need Spark 1.2.0 or higher for the above to work. See
Note you can use HiveQL syntax for creating dynamically partitioned tables
though.
On Sun, Mar 22, 2015 at 1:29 PM, Michael Armbrust mich...@databricks.com
wrote:
Not yet. This is on the roadmap for Spark 1.4.
On Sun, Mar 22, 2015 at 12:19 AM, deenar.toraskar deenar.toras...@db.com
wrote:
Not yet. This is on the roadmap for Spark 1.4.
On Sun, Mar 22, 2015 at 12:19 AM, deenar.toraskar deenar.toras...@db.com
wrote:
Hi
I wanted to store DataFrames as partitioned Hive tables. Is there a way to
do this via the saveAsTable call. The set of options does not seem to be
documented.
You can include * and a column alias in the same select clause
var df1 = sqlContext.sql(select *, column_id AS table1_id from table1)
I'm also hoping to resolve SPARK-6376
https://issues.apache.org/jira/browse/SPARK-6376 before Spark 1.3.1 which
will let you do something like:
var df1 =
Something like this?
(2 to alphabetLength toList).map(shift = Object.myFunction(inputRDD,
shift).map(v = shift - v).foldLeft(Object.myFunction(inputRDD, 1).map(v
= 1 - v))(_ union _)
which is an RDD[(Int, Char)]
Problem is that you can't play with RDDs inside of RDDs. The recursive
structure
I'm following some online tutorial written in Python and trying to convert
a Spark SQL table object to an RDD in Scala.
The Spark SQL just loads a simple table from a CSV file. The tutorial says
to convert the table to an RDD.
The Python is
products_rdd = sqlContext.table(products).map(lambda
You can do this with the 'cartesian' product method on RDD. For example:
val rdd1 = ...
val rdd2 = ...
val combinations = rdd1.cartesian(rdd2).filter{ case (a,b) = a b }
Reza
On Sat, Mar 21, 2015 at 10:37 PM, Xi Shen davidshe...@gmail.com wrote:
Hi,
I have two big RDD, and I need to do
What do you mean not distinct?
It does works for me:
[image: Inline image 1]
Code:
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkContext, SparkConf}
val ssc = new StreamingContext(sc, Seconds(1))
val data =
You need either
|.map { row =
(row(0).asInstanceOf[Float], row(1).asInstanceOf[Float], ...)
}
|
or
|.map {case Row(f0:Float, f1:Float, ...) =
(f0, f1)
}
|
On 3/23/15 9:08 AM, Minnow Noir wrote:
I'm following some online tutorial written in Python and trying to
convert a Spark SQL table
I thought of formation #1.
But looks like when there're many fields, formation #2 is cleaner.
Cheers
On Sun, Mar 22, 2015 at 8:14 PM, Cheng Lian lian.cs@gmail.com wrote:
You need either
.map { row =
(row(0).asInstanceOf[Float], row(1).asInstanceOf[Float], ...)
}
or
.map { case
Hi, spark users.
When running a spark application with lots of executors(300+), I see following
failures:
java.net.SocketTimeoutException: Read timed out at
java.net.SocketInputStream.socketRead0(Native Method) at
java.net.SocketInputStream.read(SocketInputStream.java:152) at
How are you running your spark instance out of curiosity? Via YARN or
standalone mode? When connecting Spark thriftserver to the Spark service,
have you allocated enough memory and CPU when executing with spark?
On Sun, Mar 22, 2015 at 3:39 AM fanooos dev.fano...@gmail.com wrote:
We have
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote:
so finally i can resort to:
rdd.saveAsObjectFile(...)
sc.objectFile(...)
but that seems like a rather broken abstraction.
This seems like a fine solution to me.
39 matches
Mail list logo