Can you name the features that make databricks better than zepplin?
Eran
On Fri, 29 Jan 2016 at 01:37 Michal Klos wrote:
> We use both databricks and emr. We use databricks for our exploratory /
> adhoc use cases because their notebook is pretty badass and better than
>
hange.
> DF.withColumn("upper-code",upper(df("countrycode"))).
>
> This creates a new column "upper-code". Is there a way to update the
> column or create a new df with update column?
>
> Thanks,
> Raja
>
> On Thursday, 24 December 2015 6:17 P
If you drop other columns (or map to a new df with only that column) and
call collect i think you will get what you want.
On Fri, 25 Dec 2015 at 10:26 fightf...@163.com wrote:
> Emm...I think you can do a df.map and store each column value to your list.
>
>
Hi,
I have a JSON file with the following row format:
{"cty":"United
Kingdom","gzip":"H4sIAKtWystVslJQcs4rLVHSUUouqQTxQvMyS1JTFLwz89JT8nOB4hnFqSBxj/zS4lSF/DQFl9S83MSibKBMZVExSMbQwNBM19DA2FSpFgDvJUGVUw==","nm":"Edmund
lronside","yrs":"1016"}
The gzip field is a compressed JSON by
the JSON string representation of the whole
line and you have a nested JSON schema which SparkSQL can read.
Eran
On Thu, Dec 24, 2015 at 10:26 AM Eran Witkon <eranwit...@gmail.com> wrote:
> I don't have the exact answer for you but I would look for something using
> explode method
t;nm": \"$nm\" , "yrs":
\"$yrs\"}"""})
See this link for source
http://stackoverflow.com/questions/34069282/how-to-query-json-data-column-using-spark-dataframes
Eran
On Thu, Dec 24, 2015 at 11:42 AM Eran Witkon <eranwit...@gmail.
Use DF.withColumn("upper-code",df("countrycode).toUpper))
or just run a map function that does the same
On Thu, Dec 24, 2015 at 2:05 PM Bharathi Raja
wrote:
> Hi,
> Values in a dataframe column named countrycode are in different cases. Eg:
> (US, us). groupBy & count
Did you get a solution for this?
On Tue, 22 Dec 2015 at 20:24 raja kbv wrote:
> Hi,
>
> I am new to spark.
>
> I have a text file with below structure.
>
>
> (employeeID: Int, Name: String, ProjectDetails: JsonObject{[{ProjectName,
> Description, Duriation, Role}]})
>
here to have a better
> understanding
> http://stackoverflow.com/questions/3589562/why-maven-what-are-the-benefits
>
> Thanks
> Best Regards
>
> On Wed, Dec 23, 2015 at 4:27 PM, Eran Witkon <eranwit...@gmail.com> wrote:
>
>> Thanks, all of these examples shows how
anks
> Best Regards
>
> On Mon, Dec 21, 2015 at 10:51 PM, Eran Witkon <eranwit...@gmail.com>
> wrote:
>
>> Any pointers how to use InteliJ for spark development?
>> Any way to use scala worksheet run like spark- shell?
>>
>
>
;
>> look for differences: packages versions, cpu/network/memory diff etc etc
>>
>>
>> On 21 December 2015 at 14:53, Eran Witkon <eranwit...@gmail.com> wrote:
>>
>>> Hi,
>>> I know it is a wide question but can you think of reasons why a pyspark
Hi,
I know it is a wide question but can you think of reasons why a pyspark job
which runs on from server 1 using user 1 will run faster then the same job
when running on server 2 with user 1
Eran
Any pointers how to use InteliJ for spark development?
Any way to use scala worksheet run like spark- shell?
ption if there is any record that it cannot parse.
> Instead, it will put the entire record to the column of "_corrupt_record".
>
> Thanks,
>
> Yin
>
> On Sun, Dec 20, 2015 at 9:37 AM, Eran Witkon <eranwit...@gmail.com> wrote:
>
>> Thanks for this!
>>
Ptoblem resolved, syntext issue )-:
On Mon, 21 Dec 2015 at 06:09 Jeff Zhang <zjf...@gmail.com> wrote:
> If it does not return a column you expect, then what does this return ? Do
> you will have 2 columns with the same column name ?
>
> On Sun, Dec 20, 2015 at 7:40 PM, Er
Hi,
I am a bit confused with dataframe operations.
I have a function which takes a string and returns a string
I want to apply this functions on all rows on a single column in my
dataframe
I was thinking of the following:
jsonData.withColumn("computedField",computeString(jsonData("hse")))
BUT
disregard my last question - my mistake.
I accessed it as a col not as a row :
jsonData.first.getAs[String]("cty")
Eran
On Sun, Dec 20, 2015 at 11:42 AM Eran Witkon <eranwit...@gmail.com> wrote:
> Thanks, That's works.
> One other thing -
> I have the followi
ova...@gmail.com>
wrote:
> Just point loader to the folder. You do not need *
> On Dec 19, 2015 11:21 PM, "Eran Witkon" <eranwit...@gmail.com> wrote:
>
>> Hi,
>> Can I combine multiple JSON files to one DataFrame?
>>
>> I tried
>> val df = sql
rmail
>
> On December 20, 2015 at 21:43:42, Eran Witkon (eranwit...@gmail.com)
> wrote:
>
> Hi,
> I am using spark-shell with version 1.5.2.
> scala> sc.version
> res17: String = 1.5.2
>
> but when trying to use StructType I am getting error:
> val struct =
>
Hi,
I am using spark-shell with version 1.5.2.
scala> sc.version
res17: String = 1.5.2
but when trying to use StructType I am getting error:
val struct =
StructType(
StructField("a", IntegerType, true) ::
StructField("b", LongType, false) ::
StructField("c", BooleanType, false) ::
Hi,
I have an RDD
jsonGzip
res3: org.apache.spark.rdd.RDD[(String, String, String, String)] =
MapPartitionsRDD[8] at map at :65
which I want to convert to a DataFrame with schema
so I created a schema:
al schema =
StructType(
StructField("cty", StringType, false) ::
examples/src/main/resources/people.txt").map(
>* _.split(",")).map(p => Row(p(0), p(1).trim.toInt))
>* val dataFrame = sqlContext.createDataFrame(people, schema)
>* dataFrame.printSchema
>* // root
>* // |-- name: string (nullable = false)
>
lar problem exists for XML, btw. there's lots of wonky workarounds
> for this that use MapPartitions and all kinds of craziness. the best
> option, in my opinion, is to just ETL/flatten the data to make the
> DataFrame reader happy.
>
> On Dec 19, 2015, at 4:55 PM, Eran Witkon
Got it to work, thanks
On Sun, 20 Dec 2015 at 17:01 Eran Witkon <eranwit...@gmail.com> wrote:
> I might be missing you point but I don't get it.
> My understanding is that I need a RDD containing Rows but how do I get it?
>
> I started with a DataFrame
> run a map on it an
Hi,
I tried the following code in spark-shell on spark1.5.2:
*val df =
sqlContext.read.json("/home/eranw/Workspace/JSON/sample/sample2.json")*
*df.count()*
15/12/19 23:49:40 ERROR Executor: Managed memory leak detected; size =
67108864 bytes, TID = 3
15/12/19 23:49:40 ERROR Executor: Exception
ressedData.take(count)
> while (count > 0) {
> count = inflater.inflate(decompressedData)
> finalData = finalData ++ decompressedData.take(count)
> }
> new String(finalData)
> })
>
>
>
>
> Thanks
> Best Regards
>
Hi,
Can I combine multiple JSON files to one DataFrame?
I tried
val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/*")
but I get an empty DF
Eran
Hi,
I am trying to install spark 1.5.2 on Apache hadoop 2.6 and Hive and yarn
spark-env.sh
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
bash_profile
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
export HADOOP_INSTALL=/usr/local/hadoop
export
Hi,
I have about 8K files on about 10 directories on hdfs and I need to add a
column to all files with the file name (e.g. file1.txt adds a column with
file1.txt, file 2 with "file2.txt" etc)
The current approach was to read all files using *sc.WholeTextFiles("myPath")
*and have the file name as
Hi,
I have a few JSON files in which one of the field is a binary filed - this
field is the output of running GZIP of a JSON stream and compressing it to
the binary field.
Now I want to de-compress the field and get the outpur JSON.
I was thinking of running map operation and passing a function
at 08:27 Eran Witkon <eranwit...@gmail.com> wrote:
> But what if I don't have more memory?
> On Wed, 16 Dec 2015 at 08:13 Zhan Zhang <zzh...@hortonworks.com> wrote:
>
>> There are two cases here. If the container is killed by yarn, you can
>> increase jvm overhead
When running
val data = sc.wholeTextFile("someDir/*") data.count()
I get numerous warning from yarn till I get aka association exception.
Can someone explain what happen when spark loads this rdd and can't fit it
all in memory?
Based on the exception it looks like the server is disconnecting from
xxx”, where you can possible find the cause.
>
> Thanks.
>
> Zhan Zhang
>
> On Dec 15, 2015, at 11:50 AM, Eran Witkon <eranwit...@gmail.com> wrote:
>
> > When running
> > val data = sc.wholeTextFile("someDir/*") data.count()
> >
> > I get
leak happening.
>
> Thanks.
>
> Zhan Zhang
>
> On Dec 15, 2015, at 9:58 PM, Eran Witkon <eranwit...@gmail.com> wrote:
>
> If the problem is containers trying to use more memory then they allowed,
> how do I limit them? I all ready have executor-memory 5G
>
34 matches
Mail list logo