Re: Databricks Cloud vs AWS EMR

2016-01-28 Thread Eran Witkon
Can you name the features that make databricks better than zepplin? Eran On Fri, 29 Jan 2016 at 01:37 Michal Klos wrote: > We use both databricks and emr. We use databricks for our exploratory / > adhoc use cases because their notebook is pretty badass and better than >

Re: How to ignore case in dataframe groupby?

2015-12-30 Thread Eran Witkon
hange. > DF.withColumn("upper-code",upper(df("countrycode"))). > > This creates a new column "upper-code". Is there a way to update the > column or create a new df with update column? > > Thanks, > Raja > > On Thursday, 24 December 2015 6:17 P

Re: How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-25 Thread Eran Witkon
If you drop other columns (or map to a new df with only that column) and call collect i think you will get what you want. On Fri, 25 Dec 2015 at 10:26 fightf...@163.com wrote: > Emm...I think you can do a df.map and store each column value to your list. > >

Extract compressed JSON withing JSON

2015-12-24 Thread Eran Witkon
Hi, I have a JSON file with the following row format: {"cty":"United Kingdom","gzip":"H4sIAKtWystVslJQcs4rLVHSUUouqQTxQvMyS1JTFLwz89JT8nOB4hnFqSBxj/zS4lSF/DQFl9S83MSibKBMZVExSMbQwNBM19DA2FSpFgDvJUGVUw==","nm":"Edmund lronside","yrs":"1016"} The gzip field is a compressed JSON by

Re: How to Parse & flatten JSON object in a text file using Spark into Dataframe

2015-12-24 Thread Eran Witkon
the JSON string representation of the whole line and you have a nested JSON schema which SparkSQL can read. Eran On Thu, Dec 24, 2015 at 10:26 AM Eran Witkon <eranwit...@gmail.com> wrote: > I don't have the exact answer for you but I would look for something using > explode method

Re: Extract compressed JSON withing JSON

2015-12-24 Thread Eran Witkon
t;nm": \"$nm\" , "yrs": \"$yrs\"}"""}) See this link for source http://stackoverflow.com/questions/34069282/how-to-query-json-data-column-using-spark-dataframes Eran On Thu, Dec 24, 2015 at 11:42 AM Eran Witkon <eranwit...@gmail.

Re: How to ignore case in dataframe groupby?

2015-12-24 Thread Eran Witkon
Use DF.withColumn("upper-code",df("countrycode).toUpper)) or just run a map function that does the same On Thu, Dec 24, 2015 at 2:05 PM Bharathi Raja wrote: > Hi, > Values in a dataframe column named countrycode are in different cases. Eg: > (US, us). groupBy & count

Re: How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-23 Thread Eran Witkon
Did you get a solution for this? On Tue, 22 Dec 2015 at 20:24 raja kbv wrote: > Hi, > > I am new to spark. > > I have a text file with below structure. > > > (employeeID: Int, Name: String, ProjectDetails: JsonObject{[{ProjectName, > Description, Duriation, Role}]}) >

Re: Using inteliJ for spark development

2015-12-23 Thread Eran Witkon
here to have a better > understanding > http://stackoverflow.com/questions/3589562/why-maven-what-are-the-benefits > > Thanks > Best Regards > > On Wed, Dec 23, 2015 at 4:27 PM, Eran Witkon <eranwit...@gmail.com> wrote: > >> Thanks, all of these examples shows how

Re: Using inteliJ for spark development

2015-12-23 Thread Eran Witkon
anks > Best Regards > > On Mon, Dec 21, 2015 at 10:51 PM, Eran Witkon <eranwit...@gmail.com> > wrote: > >> Any pointers how to use InteliJ for spark development? >> Any way to use scala worksheet run like spark- shell? >> > >

Re: fishing for help!

2015-12-21 Thread Eran Witkon
; >> look for differences: packages versions, cpu/network/memory diff etc etc >> >> >> On 21 December 2015 at 14:53, Eran Witkon <eranwit...@gmail.com> wrote: >> >>> Hi, >>> I know it is a wide question but can you think of reasons why a pyspark

fishing for help!

2015-12-21 Thread Eran Witkon
Hi, I know it is a wide question but can you think of reasons why a pyspark job which runs on from server 1 using user 1 will run faster then the same job when running on server 2 with user 1 Eran

Using inteliJ for spark development

2015-12-21 Thread Eran Witkon
Any pointers how to use InteliJ for spark development? Any way to use scala worksheet run like spark- shell?

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Eran Witkon
ption if there is any record that it cannot parse. > Instead, it will put the entire record to the column of "_corrupt_record". > > Thanks, > > Yin > > On Sun, Dec 20, 2015 at 9:37 AM, Eran Witkon <eranwit...@gmail.com> wrote: > >> Thanks for this! >>

Re: DataFrame operations

2015-12-20 Thread Eran Witkon
Ptoblem resolved, syntext issue )-: On Mon, 21 Dec 2015 at 06:09 Jeff Zhang <zjf...@gmail.com> wrote: > If it does not return a column you expect, then what does this return ? Do > you will have 2 columns with the same column name ? > > On Sun, Dec 20, 2015 at 7:40 PM, Er

DataFrame operations

2015-12-20 Thread Eran Witkon
Hi, I am a bit confused with dataframe operations. I have a function which takes a string and returns a string I want to apply this functions on all rows on a single column in my dataframe I was thinking of the following: jsonData.withColumn("computedField",computeString(jsonData("hse"))) BUT

Re: combining multiple JSON files to one DataFrame

2015-12-20 Thread Eran Witkon
disregard my last question - my mistake. I accessed it as a col not as a row : jsonData.first.getAs[String]("cty") Eran On Sun, Dec 20, 2015 at 11:42 AM Eran Witkon <eranwit...@gmail.com> wrote: > Thanks, That's works. > One other thing - > I have the followi

Re: combining multiple JSON files to one DataFrame

2015-12-20 Thread Eran Witkon
ova...@gmail.com> wrote: > Just point loader to the folder. You do not need * > On Dec 19, 2015 11:21 PM, "Eran Witkon" <eranwit...@gmail.com> wrote: > >> Hi, >> Can I combine multiple JSON files to one DataFrame? >> >> I tried >> val df = sql

Re: error: not found: value StructType on 1.5.2

2015-12-20 Thread Eran Witkon
rmail > > On December 20, 2015 at 21:43:42, Eran Witkon (eranwit...@gmail.com) > wrote: > > Hi, > I am using spark-shell with version 1.5.2. > scala> sc.version > res17: String = 1.5.2 > > but when trying to use StructType I am getting error: > val struct = >

error: not found: value StructType on 1.5.2

2015-12-20 Thread Eran Witkon
Hi, I am using spark-shell with version 1.5.2. scala> sc.version res17: String = 1.5.2 but when trying to use StructType I am getting error: val struct = StructType( StructField("a", IntegerType, true) :: StructField("b", LongType, false) :: StructField("c", BooleanType, false) ::

How to convert and RDD to DF?

2015-12-20 Thread Eran Witkon
Hi, I have an RDD jsonGzip res3: org.apache.spark.rdd.RDD[(String, String, String, String)] = MapPartitionsRDD[8] at map at :65 which I want to convert to a DataFrame with schema so I created a schema: al schema = StructType( StructField("cty", StringType, false) ::

Re: How to convert and RDD to DF?

2015-12-20 Thread Eran Witkon
examples/src/main/resources/people.txt").map( >* _.split(",")).map(p => Row(p(0), p(1).trim.toInt)) >* val dataFrame = sqlContext.createDataFrame(people, schema) >* dataFrame.printSchema >* // root >* // |-- name: string (nullable = false) >

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Eran Witkon
lar problem exists for XML, btw. there's lots of wonky workarounds > for this that use MapPartitions and all kinds of craziness. the best > option, in my opinion, is to just ETL/flatten the data to make the > DataFrame reader happy. > > On Dec 19, 2015, at 4:55 PM, Eran Witkon

Re: How to convert and RDD to DF?

2015-12-20 Thread Eran Witkon
Got it to work, thanks On Sun, 20 Dec 2015 at 17:01 Eran Witkon <eranwit...@gmail.com> wrote: > I might be missing you point but I don't get it. > My understanding is that I need a RDD containing Rows but how do I get it? > > I started with a DataFrame > run a map on it an

spark 1.5.2 memory leak? reading JSON

2015-12-19 Thread Eran Witkon
Hi, I tried the following code in spark-shell on spark1.5.2: *val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/sample2.json")* *df.count()* 15/12/19 23:49:40 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 3 15/12/19 23:49:40 ERROR Executor: Exception

Re: Using Spark to process JSON with gzip filed

2015-12-19 Thread Eran Witkon
ressedData.take(count) > while (count > 0) { > count = inflater.inflate(decompressedData) > finalData = finalData ++ decompressedData.take(count) > } > new String(finalData) > }) > > > > > Thanks > Best Regards >

combining multiple JSON files to one DataFrame

2015-12-19 Thread Eran Witkon
Hi, Can I combine multiple JSON files to one DataFrame? I tried val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/*") but I get an empty DF Eran

Can't run spark on yarn

2015-12-17 Thread Eran Witkon
Hi, I am trying to install spark 1.5.2 on Apache hadoop 2.6 and Hive and yarn spark-env.sh export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop bash_profile #HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ export HADOOP_INSTALL=/usr/local/hadoop export

WholeTextFile for 8000~ files - problem

2015-12-16 Thread Eran Witkon
Hi, I have about 8K files on about 10 directories on hdfs and I need to add a column to all files with the file name (e.g. file1.txt adds a column with file1.txt, file 2 with "file2.txt" etc) The current approach was to read all files using *sc.WholeTextFiles("myPath") *and have the file name as

Using Spark to process JSON with gzip filed

2015-12-16 Thread Eran Witkon
Hi, I have a few JSON files in which one of the field is a binary filed - this field is the output of running GZIP of a JSON stream and compressing it to the binary field. Now I want to de-compress the field and get the outpur JSON. I was thinking of running map operation and passing a function

Re: Spark big rdd problem

2015-12-16 Thread Eran Witkon
at 08:27 Eran Witkon <eranwit...@gmail.com> wrote: > But what if I don't have more memory? > On Wed, 16 Dec 2015 at 08:13 Zhan Zhang <zzh...@hortonworks.com> wrote: > >> There are two cases here. If the container is killed by yarn, you can >> increase jvm overhead

Spark big rdd problem

2015-12-15 Thread Eran Witkon
When running val data = sc.wholeTextFile("someDir/*") data.count() I get numerous warning from yarn till I get aka association exception. Can someone explain what happen when spark loads this rdd and can't fit it all in memory? Based on the exception it looks like the server is disconnecting from

Re: Spark big rdd problem

2015-12-15 Thread Eran Witkon
xxx”, where you can possible find the cause. > > Thanks. > > Zhan Zhang > > On Dec 15, 2015, at 11:50 AM, Eran Witkon <eranwit...@gmail.com> wrote: > > > When running > > val data = sc.wholeTextFile("someDir/*") data.count() > > > > I get

Re: Spark big rdd problem

2015-12-15 Thread Eran Witkon
leak happening. > > Thanks. > > Zhan Zhang > > On Dec 15, 2015, at 9:58 PM, Eran Witkon <eranwit...@gmail.com> wrote: > > If the problem is containers trying to use more memory then they allowed, > how do I limit them? I all ready have executor-memory 5G >