Re: Will be in around 12:30pm due to some personal stuff

2017-01-19 Thread Gavin Yue
PST or est ? > On Jan 19, 2017, at 21:55, ayan guha wrote: > > Sure...we will wait :) :) > > Just kidding > >> On Fri, Jan 20, 2017 at 4:48 PM, Manohar753 >> wrote: >> Get Outlook for Android >> Happiest Minds Disclaimer >> This

Re: Deep learning libraries for scala

2016-09-30 Thread Gavin Yue
Skymind you could try. It is java I never test though. > On Sep 30, 2016, at 7:30 PM, janardhan shetty wrote: > > Hi, > > Are there any good libraries which can be used for scala deep learning models > ? > How can we integrate tensorflow with scala ML ?

Re: Re[6]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Gavin Yue
Any shuffling? > On Sep 3, 2016, at 5:50 AM, Сергей Романов wrote: > > Same problem happens with CSV data file, so it's not parquet-related either. > > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >

Re: EMR for spark job - instance type suggestion

2016-08-26 Thread Gavin Yue
I tried both M4 and R3. R3 is slightly more expensive, but has larger memory. If you doing a lot of in-memory staff, like Join. I recommend R3. Otherwise M4 is fine. Also I remember M4 is EBS instance, so you have to pay for additional EBS cost as well. On Fri, Aug 26, 2016 at 10:29 AM,

How to output RDD to one file with specific name?

2016-08-25 Thread Gavin Yue
I am trying to output RDD to disk by rdd.coleasce(1).saveAsTextFile("/foo") It outputs to foo folder with a file with name: Part-0. Is there a way I could directly save the file as /foo/somename ? Thanks.

Re: Java Recipes for Spark

2016-07-29 Thread Gavin Yue
This is useful:) Thank you for sharing. > On Jul 29, 2016, at 1:30 PM, Jean Georges Perrin wrote: > > Sorry if this looks like a shameless self promotion, but some of you asked me > to say when I'll have my Java recipes for Apache Spark updated. It's done > here:

Re: Running Spark in Standalone or local modes

2016-06-11 Thread Gavin Yue
din.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > >> On 11 June 2016 at 22:26, Gavin Yue <yue.yuany...@gmail.com> wrote: >> The standalone mode is against Yarn mode or Mesos mode, which means spark >> uses

Re: Running Spark in Standalone or local modes

2016-06-11 Thread Gavin Yue
The standalone mode is against Yarn mode or Mesos mode, which means spark uses Yarn or Mesos as cluster managements. Local mode is actually a standalone mode which everything runs on the single local machine instead of remote clusters. That is my understanding. On Sat, Jun 11, 2016 at 12:40

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gavin Yue
t; > On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > >> Could you print out the sql execution plan? My guess is about broadcast >> join. >> >> >> >> On Jun 9, 2016, at 07:14, Gourav Sengupta <gourav.sengu...@gmail

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gavin Yue
Could you print out the sql execution plan? My guess is about broadcast join. > On Jun 9, 2016, at 07:14, Gourav Sengupta wrote: > > Hi, > > Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and > is there a way we can optimize the queries

Re: Behaviour of RDD sampling

2016-05-31 Thread Gavin Yue
If not reading the whole dataset, how do you know the total number of records? If not knowing total number, how do you choose 30%? > On May 31, 2016, at 00:45, pbaier wrote: > > Hi all, > > I have to following use case: > I have around 10k of jsons that I want to

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Gavin Yue
For logs file I would suggest save as gziped text file first. After aggregation, convert them into parquet by merging a few files. > On May 19, 2016, at 22:32, Deng Ching-Mallete wrote: > > IMO, it might be better to merge or compact the parquet files instead of >

Any NLP lib could be used on spark?

2016-04-19 Thread Gavin Yue
Hey, Want to try the NLP on the spark. Could anyone recommend any easy to run NLP open source lib on spark? Also is there any recommended semantic network? Thanks a lot.

Re: Spark and N-tier architecture

2016-03-29 Thread Gavin Yue
n-tiers or layers is mainly for separate a big problem into pieces smaller problem. So it is always valid. Just for different application, it means different things. Speaking of offline analytics, or big data eco-world, there are numerous way of slicing the problem into different tier/layer.

Re: Spark and N-tier architecture

2016-03-29 Thread Gavin Yue
It is a separate project based on my understanding. I am currently evaluating it right now. > On Mar 29, 2016, at 16:17, Michael Segel wrote: > > > >> Begin forwarded message: >> >> From: Michael Segel >> Subject: Re: Spark and N-tier

Re: 回复: a new FileFormat 5x~100x faster than parquet

2016-02-22 Thread Gavin Yue
I recommend you provide more information. Using inverted index certainly speed up the query time if hitting the index, but it would take longer to create and insert. Is the source code not available at this moment? Thanks Gavin > On Feb 22, 2016, at 20:27, 开心延年 wrote:

Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Gavin Yue
This sqlContext is one instance of hive context, do not be confused by the name. > On Feb 16, 2016, at 12:51, Prabhu Joseph wrote: > > Hi All, > > On creating HiveContext in spark-shell, fails with > > Caused by: ERROR XSDB6: Another instance of Derby may

How parquet file decide task number?

2016-02-03 Thread Gavin Yue
I am doing a simple count like: sqlContext.read.parquet("path").count I have only 5000 parquet files. But generate over 2 tasks. Each parquet file is converted from one gz text file. Please give some advice. Thanks

Re: How parquet file decide task number?

2016-02-03 Thread Gavin Yue
Found the answer. It is the block size. Thanks. On Wed, Feb 3, 2016 at 5:05 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > I am doing a simple count like: > > sqlContext.read.parquet("path").count > > I have only 5000 parquet files. But generate over 200

Re: Too many tasks killed the scheduler

2016-01-11 Thread Gavin Yue
<shixi...@databricks.com > wrote: > Could you use "coalesce" to reduce the number of partitions? > > > Shixiong Zhu > > > On Mon, Jan 11, 2016 at 12:21 AM, Gavin Yue <yue.yuany...@gmail.com> > wrote: > >> Here is more info. >> >> T

Re: Too many tasks killed the scheduler

2016-01-11 Thread Gavin Yue
spark.network.timeout from 120s to 600s. It sometimes works. Each task is a parquet file. I could not repartition due to out of GC problem. Is there any way I could to improve the performance? Thanks, Gavin On Sun, Jan 10, 2016 at 1:51 AM, Gavin Yue <yue.yuany...@gmail.com> wrote: > Hey, > >

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Gavin Yue
Has anyone used Ignite in production system ? On Mon, Jan 11, 2016 at 11:44 PM, Jörn Franke wrote: > You can look at ignite as a HDFS cache or for storing rdds. > > > On 11 Jan 2016, at 21:14, Dmitry Goldenberg > wrote: > > > > We have a bunch

parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-10 Thread Gavin Yue
Hey, I am trying to convert a bunch of json files into parquet, which would output over 7000 parquet files. But tthere are too many files, so I want to repartition based on id to 3000. But I got the error of GC problem like this one:

Too many tasks killed the scheduler

2016-01-10 Thread Gavin Yue
Hey, I have 10 days data, each day has a parquet directory with over 7000 partitions. So when I union 10 days and do a count, then it submits over 70K tasks. Then the job failed silently with one container exit with code 1. The union with like 5, 6 days data is fine. In the spark-shell, it just

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Gavin Yue
faster. > > LZ4 is faster than GZIP and smaller than Snappy. > > Cheers > > On Fri, Jan 8, 2016 at 7:56 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > >> Thank you . >> >> And speaking of compression, is there big difference on performance >>

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Gavin Yue
So I tried to set the parquet compression codec to lzo, but hadoop does not have the lzo natives, while lz4 does included. But I could set the code to lz4, it only accepts lzo. Any solution here? Thank, Gavin On Sat, Jan 9, 2016 at 12:09 AM, Gavin Yue <yue.yuany...@gmail.com> wrote: &

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
ta source partitioned by date ? > > Can you dedup within partitions ? > > Cheers > > On Fri, Jan 8, 2016 at 2:10 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > >> I tried on Three day's data. The total input is only 980GB, but the >> shuffle write Data is

How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
Hey, I got everyday's Event table and want to merge them into a single Event table. But there so many duplicates among each day's data. I use Parquet as the data source. What I am doing now is EventDay1.unionAll(EventDay2).distinct().write.parquet("a new parquet file"). Each day's Event is

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
And the most frequent operation I am gonna do is find the UserID who have some events, then retrieve all the events associted with the UserID. In this case, how should I partition to speed up the process? Thanks. On Fri, Jan 8, 2016 at 2:29 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
this process? Thanks. On Fri, Jan 8, 2016 at 2:04 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > Hey, > > I got everyday's Event table and want to merge them into a single Event > table. But there so many duplicates among each day's data. > > I use Parquet as the data sour

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
;> load either full data set or a defined number of partitions to see if the >> event has already come (and no gurantee it is full proof, but lead to >> unnecessary loading in most cases). >> >> On Sat, Jan 9, 2016 at 12:56 PM, Gavin Yue <yue.yuany...@gmail.com&

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
tRuvrm1CGzBJ > > Gavin: > Which release of hbase did you play with ? > > HBase has been evolving and is getting more stable. > > Cheers > > On Fri, Jan 8, 2016 at 6:29 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > >> I used to maintain a HBase cluster. The expe

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
ach bucket. Because each bucket is small, spark can > get it done faster than having everything in one run. > - I think using groupBy (userId, timestamp) might be better than > distinct. I guess distinct() will compare every field. > > > On Fri, Jan 8, 2016 at

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
et, v) => set += v, >> (setOne, setTwo) => setOne ++= setTwo) >> >> On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: >> >>> Hey, >>> >>> For example, a table df with two columns >>> id name

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
I found that in 1.6 dataframe could do repartition. Should I still need to do orderby first or I just have to repartition? On Tue, Jan 5, 2016 at 9:25 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > I tried the Ted's solution and it works. But I keep hitting the JVM out > of me

How to accelerate reading json file?

2016-01-05 Thread Gavin Yue
I am trying to read json files following the example: val path = "examples/src/main/resources/jsonfile"val people = sqlContext.read.json(path) I have 1 Tb size files in the path. It took 1.2 hours to finish the reading to infer the schema. But I already know the schema. Could I make this

How to concat few rows into a new column in dataframe

2016-01-05 Thread Gavin Yue
Hey, For example, a table df with two columns id name 1 abc 1 bdf 2 ab 2 cd I want to group by the id and concat the string into array of string. like this id 1 [abc,bdf] 2 [ab, cd] How could I achieve this in dataframe? I stuck on df.groupBy("id"). ??? Thanks

Should I convert json into parquet?

2015-10-17 Thread Gavin Yue
I have json files which contains timestamped events. Each event associate with a user id. Now I want to group by user id. So converts from Event1 -> UserIDA; Event2 -> UserIDA; Event3 -> UserIDB; To intermediate storage. UserIDA -> (Event1, Event2...) UserIDB-> (Event3...) Then I will label

Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-26 Thread Gavin Yue
gt; > > > > > On Saturday, September 26, 2015 10:07 AM, Gavin Yue < > yue.yuany...@gmail.com> wrote: > > > Print out your env variables and check first > > Sent from my iPhone > > On Sep 25, 2015, at 18:43, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID >

Re: executor-cores setting does not work under Yarn

2015-09-25 Thread Gavin Yue
r conf/spark-defaults.conf file? > > Thanks > Best Regards > > On Fri, Sep 25, 2015 at 1:58 AM, Gavin Yue <yue.yuany...@gmail.com> wrote: > >> Running Spark app over Yarn 2.7 >> >> Here is my sparksubmit setting: >> --master yarn-cluster \ >> --num-

Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Gavin Yue
Print out your env variables and check first Sent from my iPhone > On Sep 25, 2015, at 18:43, Zhiliang Zhu wrote: > > Hi All, > > I would like to submit spark job on some another remote machine outside the > cluster, > I also copied hadoop/spark conf files under

executor-cores setting does not work under Yarn

2015-09-24 Thread Gavin Yue
Running Spark app over Yarn 2.7 Here is my sparksubmit setting: --master yarn-cluster \ --num-executors 100 \ --executor-cores 3 \ --executor-memory 20g \ --driver-memory 20g \ --driver-cores 2 \ But the executor cores setting is not working. It always assigns only one vcore to one

Cache after filter Vs Writing back to HDFS

2015-09-17 Thread Gavin Yue
For a large dataset, I want to filter out something and then do the computing intensive work. What I am doing now: Data.filter(somerules).cache() Data.count() Data.map(timeintensivecompute) But this sometimes takes unusually long time due to cache missing and recalculation. So I changed to

Performance changes quite large

2015-09-17 Thread Gavin Yue
I am trying to parse quite a lot large json files. At the beginning, I am doing like this textFile(path).map(parseJson(line)).count() For each file(800 - 900 Mb), it would take roughtly 1 min to finish. I then changed the code tl val rawData = textFile(path) rawData.cache() rawData.count()

Re: How to increase the Json parsing speed

2015-08-28 Thread Gavin Yue
? On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: I see that you are not reusing the same mapper instance in the Scala snippet. Regards Sab On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue yue.yuany...@gmail.com wrote: Just did some tests. I have

Re: How to increase the Json parsing speed

2015-08-27 Thread Gavin Yue
? Thanks a lot! On Thu, Aug 27, 2015 at 7:45 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: For your jsons, can you tell us what is your benchmark when running on a single machine using just plain Java (without Spark and Spark sql)? Regards Sab On 28-Aug-2015 7:29 am, Gavin Yue

How to increase the Json parsing speed

2015-08-27 Thread Gavin Yue
Hey I am using the Json4s-Jackson parser coming with spark and parsing roughly 80m records with totally size 900mb. But the speed is slow. It took my 50 nodes(16cores cpu,100gb mem) roughly 30mins to parse Json to use spark sql. Jackson has the benchmark saying parsing should be ms level.

Re: Abount Jobs UI in yarn-client mode

2015-06-20 Thread Gavin Yue
I got the same problem when I upgrade from 1.3.1 to 1.4. The same Conf has been used, 1.3 works, but 1.4UI does not work. So I added the property nameyarn.resourcemanager.webapp.address/name value:8088/value /property property nameyarn.resourcemanager.hostname/name

How could output the StreamingLinearRegressionWithSGD prediction result?

2015-06-20 Thread Gavin Yue
Hey, I am testing the StreamingLinearRegressionWithSGD following the tutorial. It works, but I could not output the prediction results. I tried the saveAsTextFile, but it only output _SUCCESS to the folder. I am trying to check the prediction results and use BinaryClassificationMetrics to get

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Gavin Yue
On Jun 14, 2015, at 02:10, ayan guha guha.a...@gmail.com wrote: Can you do dedupe process locally for each file first and then globally? Also I did not fully get the logic of the part inside reducebykey. Can you kindly explain? On 14 Jun 2015 13:58, Gavin Yue yue.yuany...@gmail.com wrote

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So totally 5TB data. The data is formatted as key t/ value. After union, I want to remove the duplicates among keys. So each key should be unique and has only one value. Here is what I am doing. folders =

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So totally 5TB data. The data is formatted as key t/ value. After union, I want to remove the duplicates among keys. So each key should be unique and has only one value. Here is what I am doing. folders =