Saving an Image file using binary Files - pyspark

2016-02-13 Thread Sainath Palla
Hell All, How can I save Image file(JPG format) into my local system. I used BinaryFiles to load the pictures into spark, converted them into Array and processed them. Below is the code *images = sc.binaryFiles("path/car*") * *imagerdd = images.map(lambda (x,y):

Re: new to Spark - trying to get a basic example to run - could use some help

2016-02-13 Thread Chandeep Singh
Try looking at stdout logs. I ran the exactly same job as you and did not see anything on the console as well but found it in stdout. [csingh@<> ~]$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn--deploy-mode cluster--name RT_SparkPi

Re: org.apache.spark.sql.AnalysisException: undefined function lit;

2016-02-13 Thread Sebastian Piu
I've never done it that way but you can simply use the withColumn method in data frames to do it. On 13 Feb 2016 2:19 a.m., "Andy Davidson" wrote: > I am trying to add a column with a constant value to my data frame. Any > idea what I am doing wrong? > > Kind

Re: new to Spark - trying to get a basic example to run - could use some help

2016-02-13 Thread Ted Yu
Maybe a comment should be added to SparkPi.scala, telling user to look for the value in stdout log ? Cheers On Sat, Feb 13, 2016 at 3:12 AM, Chandeep Singh wrote: > Try looking at stdout logs. I ran the exactly same job as you and did not > see anything on the console

jdbc driver used by spark fails folowing first stage

2016-02-13 Thread Mich Talebzadeh
Hi, My spark shell I start with --driver-class-path /home/hduser/jars/ojdbc6.jar It finds the driver as any Map read reads the correct structure for the Oracle tables. Even when I join columns I can see the join structure: scala> empDepartments.printSchema() root |-- DEPARTMENT_ID:

Unrecognized VM option 'MaxPermSize=512M'

2016-02-13 Thread Milad khajavi
Hello, When I want to compile the Spark project, the following error occurs: milad@pc:~/workspace/source/spark$ build/mvn -DskipTests clean package Using `mvn` from path: /home/milad/.linuxbrew/bin/mvn Unrecognized VM option 'MaxPermSize=512M' Error: Could not create the Java Virtual Machine.

Re: Unrecognized VM option 'MaxPermSize=512M'

2016-02-13 Thread Ted Yu
I have the following for my shell: export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" How do you specify MAVEN_OPTS ? Which version of Java / maven do you use ? Cheers On Sat, Feb 13, 2016 at 7:34 AM, Milad khajavi wrote: > Hello, > When I want

RE: jdbc driver used by spark fails following first stage, solved it

2016-02-13 Thread Mich Talebzadeh
Like many things it is not that straight forward! Need to explicitly reference oracle jar file with switch -jars spark-shell --master yarn --deploy-mode client --driver-class-path /home/hduser/jars/ojdbc6.jar --jars /home/hduser/jars/ojdbc6.jar HTH Dr Mich Talebzadeh LinkedIn

Re: Unrecognized VM option 'MaxPermSize=512M'

2016-02-13 Thread Jörn Franke
Are you using JDK8? > On 13 Feb 2016, at 16:34, Milad khajavi wrote: > > Hello, > When I want to compile the Spark project, the following error occurs: > > milad@pc:~/workspace/source/spark$ build/mvn -DskipTests clean package > Using `mvn` from path:

Best practises of share Spark cluster over few applications

2016-02-13 Thread Eugene Morozov
Hi, I have several instances of the same web-service that is running some ML algos on Spark (both training and prediction) and do some Spark unrelated job. Each web-service instance creates their on JavaSparkContext, thus they're seen as separate applications by Spark, thus they're configured

Re: Best practises of share Spark cluster over few applications

2016-02-13 Thread Jörn Franke
This is possible with yarn. You also need to think about preemption in case one web service starts doing something and after a while another web service wants also to do something. > On 13 Feb 2016, at 17:40, Eugene Morozov wrote: > > Hi, > > I have several

Re: Write spark eventLog to both HDFS and local FileSystem

2016-02-13 Thread nsalian
Hi, Thanks for the question. 1) The core-site.xml holds the parameter for the defaultFS: fs.defaultFS hdfs://:8020 This will be appended to your value in spark.eventLog.dir. So depending on which location you intend to write it to, you can point it to either HDFS or local. As far

using udf to convert Oracle number column in Data Frame

2016-02-13 Thread Mich Talebzadeh
Hi, Unfortunately Oracle table columns defined as NUMBER result in overflow. An alternative seems to be to create a UDF to map that column to Double val toDouble = udf((d: java.math.BigDecimal) => d.toString.toDouble) This is the DF I have defined to fetch one column as below

Re: using udf to convert Oracle number column in Data Frame

2016-02-13 Thread Ted Yu
Please take a look at sql/core/src/main/scala/org/apache/spark/sql/functions.scala : def udf(f: AnyRef, dataType: DataType): UserDefinedFunction = { UserDefinedFunction(f, dataType, None) And sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala : test("udf") { val foo =

Re: Using SPARK packages in Spark Cluster

2016-02-13 Thread Gourav Sengupta
Hi, I was interested in knowing how to load the packages into SPARK cluster started locally. Can someone pass me on the links to set the conf file so that the packages can be loaded? Regards, Gourav On Fri, Feb 12, 2016 at 6:52 PM, Burak Yavuz wrote: > Hello Gourav, > > The

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-13 Thread Abhishek Anand
Does mapWithState checkpoints the data ? When my application goes down and is restarted from checkpoint, will mapWithState need to recompute the previous batches data ? Also, to use mapWithState I will need to upgrade my application as I am using version 1.4.0 and mapWithState isnt supported

Re: How to use scala.math.Ordering in java

2016-02-13 Thread shcher
JavaPaidRDD class creates instances of scala.math.Ordering. I've been able to view the code with Fernflower decompiler in IntelliJ IDEA. For K=Integer, one can do: "Ordering ordering = Ordering$.MODULE$.comparatorToOrdering(Comparator.naturalOrder())" or implement Comparator directly. But

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-13 Thread Ted Yu
mapWithState supports checkpoint. There has been some bug fix since release of 1.6.0 e.g. SPARK-12591 NullPointerException using checkpointed mapWithState with KryoSerializer which is in the upcoming 1.6.1 Cheers On Sat, Feb 13, 2016 at 12:05 PM, Abhishek Anand

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-13 Thread Sebastian Piu
If you don't want to update your only option will be updateStateByKey then On 13 Feb 2016 8:48 p.m., "Ted Yu" wrote: > mapWithState supports checkpoint. > > There has been some bug fix since release of 1.6.0 > e.g. > SPARK-12591 NullPointerException using checkpointed

Re: newbie unable to write to S3 403 forbidden error

2016-02-13 Thread Patrick Plaatje
Not sure if it’s related, but in our Hadoop configuration we’re also setting sc.hadoopConfiguration().set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem”); Cheers, -patrick From: Andy Davidson Date: Friday, 12 February 2016 at 17:34 To: Igor

GroupedDataset needs a mapValues

2016-02-13 Thread Koert Kuipers
i have a Dataset[(K, V)] i would like to group by k and then reduce V using a function (V, V) => V how do i do this? i would expect something like: val ds = Dataset[(K, V)] ds.groupBy(_._1).mapValues(_._2).reduce(f) or better: ds.grouped.reduce(f) # grouped only works on Dataset[(_, _)] and i

Re: coalesce and executor memory

2016-02-13 Thread Daniel Darabos
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers wrote: > in spark, every partition needs to fit in the memory available to the core > processing it. > That does not agree with my understanding of how it works. I think you could do

Re: coalesce and executor memory

2016-02-13 Thread Daniel Darabos
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers wrote: > in spark, every partition needs to fit in the memory available to the core > processing it. > That does not agree with my understanding of how it works. I think you could do

How to store documents in hdfs and query them by id using Hive/Spark SQL

2016-02-13 Thread SRK
Hi, We have a requirement wherein we need to store the documents in hdfs. The documents are nothing but Json Strings. We should be able to query them by Id using Spark SQL/Hive Context as and when needed. What would be the correct approach to do this? Thanks! -- View this message in context:

How to query a Hive table by Id from inside map partitions

2016-02-13 Thread SRK
Hi, How can I query a hive table from inside mappartitions to retrieve a value by Id? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-query-a-Hive-table-by-Id-from-inside-map-partitions-tp26220.html Sent from the Apache Spark User

Re: coalesce and executor memory

2016-02-13 Thread Koert Kuipers
sorry i meant to say: and my way to deal with OOMs is almost always simply to increase number of partitions. maybe there is a better way that i am not aware of. On Sat, Feb 13, 2016 at 11:38 PM, Koert Kuipers wrote: > thats right, its the reduce operation that makes the

Re: GroupedDataset needs a mapValues

2016-02-13 Thread Michael Armbrust
Instead of grouping with a lambda function, you can do it with a column expression to avoid materializing an unnecessary tuple: df.groupBy($"_1") Regarding the mapValues, you can do something similar using an Aggregator

Re: Joining three tables with data frames

2016-02-13 Thread Jeff Zhang
What do you mean "does not work" ? What's the error message ? BTW would it be simpler that register the 3 data frames as temporary table and then use the sql query you used before in hive and oracle ? On Sun, Feb 14, 2016 at 9:28 AM, Mich Talebzadeh wrote: > Hi, > > > > I

Re: org.apache.spark.sql.AnalysisException: undefined function lit;

2016-02-13 Thread Michael Armbrust
selectExpr just uses the SQL parser to interpret the string you give it. So to get a string literal you would use quotes: df.selectExpr("*", "'" + time.miliseconds() + "' AS ms") On Fri, Feb 12, 2016 at 6:19 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am trying to add a column

Re: coalesce and executor memory

2016-02-13 Thread Koert Kuipers
thats right, its the reduce operation that makes the in-memory assumption, not the map (although i am still suspicious that the map actually streams from disk to disk record by record). in reality though my experience is that is spark can not fit partitions in memory it doesnt work well. i get

Re: Worker's BlockManager Folder not getting cleared

2016-02-13 Thread Abhishek Anand
Hi All, Any ideas on this one ? The size of this directory keeps on growing. I can see there are many files from a day earlier too. Cheers !! Abhi On Tue, Jan 26, 2016 at 7:13 PM, Abhishek Anand wrote: > Hi Adrian, > > I am running spark in standalone mode. > > The

Joining three tables with data frames

2016-02-13 Thread Mich Talebzadeh
Hi, I have created DFs on three Oracle tables. The join in Hive and Oracle are pretty simple SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales FROM sales s, times t, channels c WHERE s.time_id = t.time_id AND s.channel_id = c.channel_id GROUP BY

Re: GroupedDataset needs a mapValues

2016-02-13 Thread Koert Kuipers
thanks i will look into Aggregator as well On Sun, Feb 14, 2016 at 12:31 AM, Michael Armbrust wrote: > Instead of grouping with a lambda function, you can do it with a column > expression to avoid materializing an unnecessary tuple: > > df.groupBy($"_1") > > Regarding