Hell All,
How can I save Image file(JPG format) into my local system. I used
BinaryFiles to load the pictures into spark, converted them into Array and
processed them. Below is the code
*images = sc.binaryFiles("path/car*") *
*imagerdd = images.map(lambda (x,y):
Try looking at stdout logs. I ran the exactly same job as you and did not
see anything on the console as well but found it in stdout.
[csingh@<> ~]$ spark-submit --class org.apache.spark.examples.SparkPi
--master yarn--deploy-mode cluster--name RT_SparkPi
I've never done it that way but you can simply use the withColumn method in
data frames to do it.
On 13 Feb 2016 2:19 a.m., "Andy Davidson"
wrote:
> I am trying to add a column with a constant value to my data frame. Any
> idea what I am doing wrong?
>
> Kind
Maybe a comment should be added to SparkPi.scala, telling user to look for
the value in stdout log ?
Cheers
On Sat, Feb 13, 2016 at 3:12 AM, Chandeep Singh
wrote:
> Try looking at stdout logs. I ran the exactly same job as you and did not
> see anything on the console
Hi,
My spark shell I start with --driver-class-path /home/hduser/jars/ojdbc6.jar
It finds the driver as any Map read reads the correct structure for the
Oracle tables. Even when I join columns I can see the join structure:
scala> empDepartments.printSchema()
root
|-- DEPARTMENT_ID:
Hello,
When I want to compile the Spark project, the following error occurs:
milad@pc:~/workspace/source/spark$ build/mvn -DskipTests clean package
Using `mvn` from path: /home/milad/.linuxbrew/bin/mvn
Unrecognized VM option 'MaxPermSize=512M'
Error: Could not create the Java Virtual Machine.
I have the following for my shell:
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M
-XX:ReservedCodeCacheSize=512m"
How do you specify MAVEN_OPTS ?
Which version of Java / maven do you use ?
Cheers
On Sat, Feb 13, 2016 at 7:34 AM, Milad khajavi wrote:
> Hello,
> When I want
Like many things it is not that straight forward!
Need to explicitly reference oracle jar file with switch -jars
spark-shell --master yarn --deploy-mode client --driver-class-path
/home/hduser/jars/ojdbc6.jar --jars /home/hduser/jars/ojdbc6.jar
HTH
Dr Mich Talebzadeh
LinkedIn
Are you using JDK8?
> On 13 Feb 2016, at 16:34, Milad khajavi wrote:
>
> Hello,
> When I want to compile the Spark project, the following error occurs:
>
> milad@pc:~/workspace/source/spark$ build/mvn -DskipTests clean package
> Using `mvn` from path:
Hi,
I have several instances of the same web-service that is running some ML
algos on Spark (both training and prediction) and do some Spark unrelated
job. Each web-service instance creates their on JavaSparkContext, thus
they're seen as separate applications by Spark, thus they're configured
This is possible with yarn. You also need to think about preemption in case one
web service starts doing something and after a while another web service wants
also to do something.
> On 13 Feb 2016, at 17:40, Eugene Morozov wrote:
>
> Hi,
>
> I have several
Hi,
Thanks for the question.
1) The core-site.xml holds the parameter for the defaultFS:
fs.defaultFS
hdfs://:8020
This will be appended to your value in spark.eventLog.dir. So depending on
which location you intend to write it to, you can point it to either HDFS or
local.
As far
Hi,
Unfortunately Oracle table columns defined as NUMBER result in overflow.
An alternative seems to be to create a UDF to map that column to Double
val toDouble = udf((d: java.math.BigDecimal) => d.toString.toDouble)
This is the DF I have defined to fetch one column as below
Please take a look
at sql/core/src/main/scala/org/apache/spark/sql/functions.scala :
def udf(f: AnyRef, dataType: DataType): UserDefinedFunction = {
UserDefinedFunction(f, dataType, None)
And sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala :
test("udf") {
val foo =
Hi,
I was interested in knowing how to load the packages into SPARK cluster
started locally. Can someone pass me on the links to set the conf file so
that the packages can be loaded?
Regards,
Gourav
On Fri, Feb 12, 2016 at 6:52 PM, Burak Yavuz wrote:
> Hello Gourav,
>
> The
Does mapWithState checkpoints the data ?
When my application goes down and is restarted from checkpoint, will
mapWithState need to recompute the previous batches data ?
Also, to use mapWithState I will need to upgrade my application as I am
using version 1.4.0 and mapWithState isnt supported
JavaPaidRDD class creates instances of scala.math.Ordering.
I've been able to view the code with Fernflower decompiler in IntelliJ IDEA.
For K=Integer, one can do:
"Ordering ordering =
Ordering$.MODULE$.comparatorToOrdering(Comparator.naturalOrder())"
or implement Comparator directly.
But
mapWithState supports checkpoint.
There has been some bug fix since release of 1.6.0
e.g.
SPARK-12591 NullPointerException using checkpointed mapWithState with
KryoSerializer
which is in the upcoming 1.6.1
Cheers
On Sat, Feb 13, 2016 at 12:05 PM, Abhishek Anand
If you don't want to update your only option will be updateStateByKey then
On 13 Feb 2016 8:48 p.m., "Ted Yu" wrote:
> mapWithState supports checkpoint.
>
> There has been some bug fix since release of 1.6.0
> e.g.
> SPARK-12591 NullPointerException using checkpointed
Not sure if it’s related, but in our Hadoop configuration we’re also setting
sc.hadoopConfiguration().set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem”);
Cheers,
-patrick
From: Andy Davidson
Date: Friday, 12 February 2016 at 17:34
To: Igor
i have a Dataset[(K, V)]
i would like to group by k and then reduce V using a function (V, V) => V
how do i do this?
i would expect something like:
val ds = Dataset[(K, V)] ds.groupBy(_._1).mapValues(_._2).reduce(f)
or better:
ds.grouped.reduce(f) # grouped only works on Dataset[(_, _)] and i
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers wrote:
> in spark, every partition needs to fit in the memory available to the core
> processing it.
>
That does not agree with my understanding of how it works. I think you
could do
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers wrote:
> in spark, every partition needs to fit in the memory available to the core
> processing it.
>
That does not agree with my understanding of how it works. I think you
could do
Hi,
We have a requirement wherein we need to store the documents in hdfs. The
documents are nothing but Json Strings. We should be able to query them by
Id using Spark SQL/Hive Context as and when needed. What would be the
correct approach to do this?
Thanks!
--
View this message in context:
Hi,
How can I query a hive table from inside mappartitions to retrieve a value
by Id?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-query-a-Hive-table-by-Id-from-inside-map-partitions-tp26220.html
Sent from the Apache Spark User
sorry i meant to say:
and my way to deal with OOMs is almost always simply to increase number of
partitions. maybe there is a better way that i am not aware of.
On Sat, Feb 13, 2016 at 11:38 PM, Koert Kuipers wrote:
> thats right, its the reduce operation that makes the
Instead of grouping with a lambda function, you can do it with a column
expression to avoid materializing an unnecessary tuple:
df.groupBy($"_1")
Regarding the mapValues, you can do something similar using an Aggregator
What do you mean "does not work" ? What's the error message ? BTW would it
be simpler that register the 3 data frames as temporary table and then use
the sql query you used before in hive and oracle ?
On Sun, Feb 14, 2016 at 9:28 AM, Mich Talebzadeh
wrote:
> Hi,
>
>
>
> I
selectExpr just uses the SQL parser to interpret the string you give it.
So to get a string literal you would use quotes:
df.selectExpr("*", "'" + time.miliseconds() + "' AS ms")
On Fri, Feb 12, 2016 at 6:19 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:
> I am trying to add a column
thats right, its the reduce operation that makes the in-memory assumption,
not the map (although i am still suspicious that the map actually streams
from disk to disk record by record).
in reality though my experience is that is spark can not fit partitions in
memory it doesnt work well. i get
Hi All,
Any ideas on this one ?
The size of this directory keeps on growing.
I can see there are many files from a day earlier too.
Cheers !!
Abhi
On Tue, Jan 26, 2016 at 7:13 PM, Abhishek Anand
wrote:
> Hi Adrian,
>
> I am running spark in standalone mode.
>
> The
Hi,
I have created DFs on three Oracle tables.
The join in Hive and Oracle are pretty simple
SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS
TotalSales
FROM sales s, times t, channels c
WHERE s.time_id = t.time_id
AND s.channel_id = c.channel_id
GROUP BY
thanks i will look into Aggregator as well
On Sun, Feb 14, 2016 at 12:31 AM, Michael Armbrust
wrote:
> Instead of grouping with a lambda function, you can do it with a column
> expression to avoid materializing an unnecessary tuple:
>
> df.groupBy($"_1")
>
> Regarding
33 matches
Mail list logo