from:"Gavin Yue"

Re: Will be in around 12:30pm due to some personal stuff

2017-01-19 Thread Gavin Yue

PST or est ? > On Jan 19, 2017, at 21:55, ayan guha wrote: > > Sure...we will wait :) :) > > Just kidding > >> On Fri, Jan 20, 2017 at 4:48 PM, Manohar753 >> wrote: >> Get Outlook for Android >> Happiest Minds Disclaimer >> This message is for the sole use of the intended recipient(s) a

Re: Deep learning libraries for scala

2016-09-30 Thread Gavin Yue

Skymind you could try. It is java I never test though. > On Sep 30, 2016, at 7:30 PM, janardhan shetty wrote: > > Hi, > > Are there any good libraries which can be used for scala deep learning models > ? > How can we integrate tensorflow with scala ML ?

Re: Re[6]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Gavin Yue

Any shuffling? > On Sep 3, 2016, at 5:50 AM, Сергей Романов wrote: > > Same problem happens with CSV data file, so it's not parquet-related either. > > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ vers

Re: EMR for spark job - instance type suggestion

2016-08-26 Thread Gavin Yue

I tried both M4 and R3. R3 is slightly more expensive, but has larger memory. If you doing a lot of in-memory staff, like Join. I recommend R3. Otherwise M4 is fine. Also I remember M4 is EBS instance, so you have to pay for additional EBS cost as well. On Fri, Aug 26, 2016 at 10:29 AM, Sa

How to output RDD to one file with specific name?

2016-08-25 Thread Gavin Yue

I am trying to output RDD to disk by rdd.coleasce(1).saveAsTextFile("/foo") It outputs to foo folder with a file with name: Part-0. Is there a way I could directly save the file as /foo/somename ? Thanks.

Re: Java Recipes for Spark

2016-07-29 Thread Gavin Yue

This is useful:) Thank you for sharing. > On Jul 29, 2016, at 1:30 PM, Jean Georges Perrin wrote: > > Sorry if this looks like a shameless self promotion, but some of you asked me > to say when I'll have my Java recipes for Apache Spark updated. It's done > here: http://jgp.net/2016/07/22

Re: Running Spark in Standalone or local modes

2016-06-11 Thread Gavin Yue

anrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > >> On 11 June 2016 at 22:26, Gavin Yue wrote: >> The standalone mode is against Yarn mode or Mesos mode, which means spark >> uses Yarn or Mesos as cluster managements. >> >>

Re: Running Spark in Standalone or local modes

2016-06-11 Thread Gavin Yue

The standalone mode is against Yarn mode or Mesos mode, which means spark uses Yarn or Mesos as cluster managements. Local mode is actually a standalone mode which everything runs on the single local machine instead of remote clusters. That is my understanding. On Sat, Jun 11, 2016 at 12:40 PM,

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gavin Yue

wrote: > Hi, > > I think if we try to see why is Query 2 faster than Query 1 then all the > answers will be given without beating around the bush. That is the right > way to find out what is happening and why. > > > Regards, > Gourav > > On Thu, Jun 9, 2016 at 11:19

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gavin Yue

Could you print out the sql execution plan? My guess is about broadcast join. > On Jun 9, 2016, at 07:14, Gourav Sengupta wrote: > > Hi, > > Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and > is there a way we can optimize the queries in SPARK without the obviou

Re: Behaviour of RDD sampling

2016-05-31 Thread Gavin Yue

If not reading the whole dataset, how do you know the total number of records? If not knowing total number, how do you choose 30%? > On May 31, 2016, at 00:45, pbaier wrote: > > Hi all, > > I have to following use case: > I have around 10k of jsons that I want to use for learning. > The json

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Gavin Yue

For logs file I would suggest save as gziped text file first. After aggregation, convert them into parquet by merging a few files. > On May 19, 2016, at 22:32, Deng Ching-Mallete wrote: > > IMO, it might be better to merge or compact the parquet files instead of > keeping lots of small fil

Any NLP lib could be used on spark?

2016-04-19 Thread Gavin Yue

Hey, Want to try the NLP on the spark. Could anyone recommend any easy to run NLP open source lib on spark? Also is there any recommended semantic network? Thanks a lot.

Re: Spark and N-tier architecture

2016-03-29 Thread Gavin Yue

n-tiers or layers is mainly for separate a big problem into pieces smaller problem. So it is always valid. Just for different application, it means different things. Speaking of offline analytics, or big data eco-world, there are numerous way of slicing the problem into different tier/layer. Yo

Re: Spark and N-tier architecture

2016-03-29 Thread Gavin Yue

It is a separate project based on my understanding. I am currently evaluating it right now. > On Mar 29, 2016, at 16:17, Michael Segel wrote: > > > >> Begin forwarded message: >> >> From: Michael Segel >> Subject: Re: Spark and N-tier architecture >> Date: March 29, 2016 at 4:16:44 PM MS

Re: 回复： a new FileFormat 5x~100x faster than parquet

2016-02-22 Thread Gavin Yue

I recommend you provide more information. Using inverted index certainly speed up the query time if hitting the index, but it would take longer to create and insert. Is the source code not available at this moment? Thanks Gavin > On Feb 22, 2016, at 20:27, 开心延年 wrote: > > if apache enjo

Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Gavin Yue

This sqlContext is one instance of hive context, do not be confused by the name. > On Feb 16, 2016, at 12:51, Prabhu Joseph wrote: > > Hi All, > > On creating HiveContext in spark-shell, fails with > > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > datab

Re: How parquet file decide task number?

2016-02-03 Thread Gavin Yue

Found the answer. It is the block size. Thanks. On Wed, Feb 3, 2016 at 5:05 PM, Gavin Yue wrote: > I am doing a simple count like: > > sqlContext.read.parquet("path").count > > I have only 5000 parquet files. But generate over 2 tasks. > > Each parquet fil

How parquet file decide task number?

2016-02-03 Thread Gavin Yue

I am doing a simple count like: sqlContext.read.parquet("path").count I have only 5000 parquet files. But generate over 2 tasks. Each parquet file is converted from one gz text file. Please give some advice. Thanks

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Gavin Yue

Has anyone used Ignite in production system ? On Mon, Jan 11, 2016 at 11:44 PM, Jörn Franke wrote: > You can look at ignite as a HDFS cache or for storing rdds. > > > On 11 Jan 2016, at 21:14, Dmitry Goldenberg > wrote: > > > > We have a bunch of Spark jobs deployed and a few large resource f

Re: Too many tasks killed the scheduler

2016-01-11 Thread Gavin Yue

wrote: > Could you use "coalesce" to reduce the number of partitions? > > > Shixiong Zhu > > > On Mon, Jan 11, 2016 at 12:21 AM, Gavin Yue > wrote: > >> Here is more info. >> >> The job stuck at: >> INFO cluster.YarnScheduler: Adding task

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-11 Thread Gavin Yue

u saw "3000 jobs" failed. Were you writing each Parquet > file with an individual job? (Usually people use > write.partitionBy(...).parquet(...) to write multiple Parquet files.) > > Cheng > > > On 1/10/16 10:12 PM, Gavin Yue wrote: > >> Hey, >> >> I

Re: Too many tasks killed the scheduler

2016-01-11 Thread Gavin Yue

spark.network.timeout from 120s to 600s. It sometimes works. Each task is a parquet file. I could not repartition due to out of GC problem. Is there any way I could to improve the performance? Thanks, Gavin On Sun, Jan 10, 2016 at 1:51 AM, Gavin Yue wrote: > Hey, > > I have 10 days data, each

parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-10 Thread Gavin Yue

Hey, I am trying to convert a bunch of json files into parquet, which would output over 7000 parquet files. But tthere are too many files, so I want to repartition based on id to 3000. But I got the error of GC problem like this one: https://mail-archives.apache.org/mod_mbox/spark-user/201512.mb

Too many tasks killed the scheduler

2016-01-10 Thread Gavin Yue

Hey, I have 10 days data, each day has a parquet directory with over 7000 partitions. So when I union 10 days and do a count, then it submits over 70K tasks. Then the job failed silently with one container exit with code 1. The union with like 5, 6 days data is fine. In the spark-shell, it just

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Gavin Yue

So I tried to set the parquet compression codec to lzo, but hadoop does not have the lzo natives, while lz4 does included. But I could set the code to lz4, it only accepts lzo. Any solution here? Thank, Gavin On Sat, Jan 9, 2016 at 12:09 AM, Gavin Yue wrote: > I saw in the document,

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Gavin Yue

aster than GZIP and smaller than Snappy. > > Cheers > > On Fri, Jan 8, 2016 at 7:56 PM, Gavin Yue wrote: > >> Thank you . >> >> And speaking of compression, is there big difference on performance >> between gzip and snappy? And why parquet is using gzip by defau

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue

vin: > Which release of hbase did you play with ? > > HBase has been evolving and is getting more stable. > > Cheers > > On Fri, Jan 8, 2016 at 6:29 PM, Gavin Yue wrote: > >> I used to maintain a HBase cluster. The experience with it was not happy. >> >> I just

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue

ber of partitions to see if the >> event has already come (and no gurantee it is full proof, but lead to >> unnecessary loading in most cases). >> >> On Sat, Jan 9, 2016 at 12:56 PM, Gavin Yue >> wrote: >> >>> Hey, >>> Thank you for the answer. I

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue

s small, spark can > get it done faster than having everything in one run. > - I think using groupBy (userId, timestamp) might be better than > distinct. I guess distinct() will compare every field. > > > On Fri, Jan 8, 2016 at 2:31 PM, Gavin Yue wrote: > &

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue

And the most frequent operation I am gonna do is find the UserID who have some events, then retrieve all the events associted with the UserID. In this case, how should I partition to speed up the process? Thanks. On Fri, Jan 8, 2016 at 2:29 PM, Gavin Yue wrote: > hey Ted, > > Event

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue

by date ? > > Can you dedup within partitions ? > > Cheers > > On Fri, Jan 8, 2016 at 2:10 PM, Gavin Yue wrote: > >> I tried on Three day's data. The total input is only 980GB, but the >> shuffle write Data is about 6.2TB, then the job failed during shuff

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue

this process? Thanks. On Fri, Jan 8, 2016 at 2:04 PM, Gavin Yue wrote: > Hey, > > I got everyday's Event table and want to merge them into a single Event > table. But there so many duplicates among each day's data. > > I use Parquet as the data source. What I am do

How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue

Hey, I got everyday's Event table and want to merge them into a single Event table. But there so many duplicates among each day's data. I use Parquet as the data source. What I am doing now is EventDay1.unionAll(EventDay2).distinct().write.parquet("a new parquet file"). Each day's Event is sto

How to accelerate reading json file?

2016-01-05 Thread Gavin Yue

I am trying to read json files following the example: val path = "examples/src/main/resources/jsonfile"val people = sqlContext.read.json(path) I have 1 Tb size files in the path. It took 1.2 hours to finish the reading to infer the schema. But I already know the schema. Could I make this proces

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue

I found that in 1.6 dataframe could do repartition. Should I still need to do orderby first or I just have to repartition? On Tue, Jan 5, 2016 at 9:25 PM, Gavin Yue wrote: > I tried the Ted's solution and it works. But I keep hitting the JVM out > of memory problem. > And gr

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue

t%20Aggregator.html > > On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu wrote: > >> Something like the following: >> >> val zeroValue = collection.mutable.Set[String]() >> >> val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v, >> (setOne, setTwo

How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue

Hey, For example, a table df with two columns id name 1 abc 1 bdf 2 ab 2 cd I want to group by the id and concat the string into array of string. like this id 1 [abc,bdf] 2 [ab, cd] How could I achieve this in dataframe? I stuck on df.groupBy("id"). ??? Thanks

Should I convert json into parquet?

2015-10-17 Thread Gavin Yue

I have json files which contains timestamped events. Each event associate with a user id. Now I want to group by user id. So converts from Event1 -> UserIDA; Event2 -> UserIDA; Event3 -> UserIDB; To intermediate storage. UserIDA -> (Event1, Event2...) UserIDB-> (Event3...) Then I will label po

Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Gavin Yue

utside the > cluster, > and the job will run on yarn, similar as hadoop job is already done, could > you > confirm it could exactly work for spark... > > Do you mean that I would print those variables on linux command side? > > Best Regards, > Zhiliang > > > &

Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Gavin Yue

Print out your env variables and check first Sent from my iPhone > On Sep 25, 2015, at 18:43, Zhiliang Zhu wrote: > > Hi All, > > I would like to submit spark job on some another remote machine outside the > cluster, > I also copied hadoop/spark conf files under the remote machine, then hado

Re: executor-cores setting does not work under Yarn

2015-09-25 Thread Gavin Yue

> Thanks > Best Regards > > On Fri, Sep 25, 2015 at 1:58 AM, Gavin Yue wrote: > >> Running Spark app over Yarn 2.7 >> >> Here is my sparksubmit setting: >> --master yarn-cluster \ >> --num-executors 100 \ >> --executor-cores 3 \ >> --ex

executor-cores setting does not work under Yarn

2015-09-24 Thread Gavin Yue

Running Spark app over Yarn 2.7 Here is my sparksubmit setting: --master yarn-cluster \ --num-executors 100 \ --executor-cores 3 \ --executor-memory 20g \ --driver-memory 20g \ --driver-cores 2 \ But the executor cores setting is not working. It always assigns only one vcore to one containe

Performance changes quite large

2015-09-17 Thread Gavin Yue

I am trying to parse quite a lot large json files. At the beginning, I am doing like this textFile(path).map(parseJson(line)).count() For each file(800 - 900 Mb), it would take roughtly 1 min to finish. I then changed the code tl val rawData = textFile(path) rawData.cache() rawData.count() ra

Cache after filter Vs Writing back to HDFS

2015-09-17 Thread Gavin Yue

For a large dataset, I want to filter out something and then do the computing intensive work. What I am doing now: Data.filter(somerules).cache() Data.count() Data.map(timeintensivecompute) But this sometimes takes unusually long time due to cache missing and recalculation. So I changed to thi

Re: How to increase the Json parsing speed

2015-08-28 Thread Gavin Yue

Spark SQL? > > On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > >> I see that you are not reusing the same mapper instance in the Scala >> snippet. >> >> Regards >> Sab >> >> On Fri, Aug

Re: How to increase the Json parsing speed

2015-08-27 Thread Gavin Yue

> Regards > Sab > On 28-Aug-2015 7:29 am, "Gavin Yue" wrote: > >> Hey >> >> I am using the Json4s-Jackson parser coming with spark and parsing >> roughly 80m records with totally size 900mb. >> >> But the speed is slow. It took my 50 node

How to increase the Json parsing speed

2015-08-27 Thread Gavin Yue

Hey I am using the Json4s-Jackson parser coming with spark and parsing roughly 80m records with totally size 900mb. But the speed is slow. It took my 50 nodes(16cores cpu,100gb mem) roughly 30mins to parse Json to use spark sql. Jackson has the benchmark saying parsing should be ms level.

Any quick method to sample rdd based on one filed?

2015-08-27 Thread Gavin Yue

Hey, I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and randomly keep some Boolean:false rows. And hope in the final result, the negative ones could be 10 times more than positive ones. What would be most efficient way to do this? Thanks,

How could output the StreamingLinearRegressionWithSGD prediction result?

2015-06-20 Thread Gavin Yue

Hey, I am testing the StreamingLinearRegressionWithSGD following the tutorial. It works, but I could not output the prediction results. I tried the saveAsTextFile, but it only output _SUCCESS to the folder. I am trying to check the prediction results and use BinaryClassificationMetrics to get

Re: Abount Jobs UI in yarn-client mode

2015-06-20 Thread Gavin Yue

I got the same problem when I upgrade from 1.3.1 to 1.4. The same Conf has been used, 1.3 works, but 1.4UI does not work. So I added the yarn.resourcemanager.webapp.address :8088 yarn.resourcemanager.hostname To yarn-site.xml. The problem solved. Spark 1.4 +

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Gavin Yue

> On Jun 14, 2015, at 02:10, ayan guha wrote: > > Can you do dedupe process locally for each file first and then globally? > Also I did not fully get the logic of the part inside reducebykey. Can you > kindly explain? > >> On 14 Jun 2015 13:58, "Gavin Yue"

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue

I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So totally 5TB data. The data is formatted as key t/ value. After union, I want to remove the duplicates among keys. So each key should be unique and has only one value. Here is what I am doing. folders = Array("folder1"

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue

I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So totally 5TB data. The data is formatted as key t/ value. After union, I want to remove the duplicates among keys. So each key should be unique and has only one value. Here is what I am doing. folders = Array("folde

54 matches

Mail list logo