from:"nguyen duc Tuan"

Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan

m/about-aws/whats-new/2017/12/amazon- > redshift-introduces-late-materialization-for-faster-query-processing/) > > Thanks.. > > > On Mar 18, 2018 8:09 PM, "nguyen duc Tuan" wrote: > >> Hi @CPC, >> Parquet is column storage format, so if you want to read

Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan

Hi @CPC, Parquet is column storage format, so if you want to read data from only one column, you can do that without accessing all of your data. Spark SQL consists of a query optimizer ( see https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html), so it will optimi

Re: How does spark work?

2017-09-12 Thread nguyen duc Tuan

In genernal, RDD, which is the central concept of Spark, is just deffinition of how to get data and process data. Each partition of RDD defines how to get/process each partition of data. A series of transformation will transform every partitions of data from previous RDD. I give you very easy examp

Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread nguyen duc Tuan

Hi Chapman, You can use "cluster by" to do what you want. https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/ 2017-06-24 17:48 GMT+07:00 Saliya Ekanayake : > I haven't worked with datasets but would this help https://stackoverflow. > com/questions/37513667/how-to-create-a-spark

Re: Kafka failover with multiple data centers

2017-03-27 Thread nguyen duc Tuan

e it supports timestamp-based index. 2017-03-28 5:24 GMT+07:00 Soumitra Johri : > Hi, did you guys figure it out? > > Thanks > Soumitra > > On Sun, Mar 5, 2017 at 9:51 PM nguyen duc Tuan > wrote: > >> Hi everyone, >> We are deploying kafka cluster for ingesting strea

Kafka failover with multiple data centers

2017-03-05 Thread nguyen duc Tuan

Hi everyone, We are deploying kafka cluster for ingesting streaming data. But sometimes, some of nodes on the cluster have troubles (node dies, kafka daemon is killed...). However, Recovering data in Kafka can be very slow. It takes serveral hours to recover from disaster. I saw a slide here sugges

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-23 Thread nguyen duc Tuan

ache.org/jira/browse/SPARK-18082). It could be there >> are some implementation details that would make a difference here. >> >> By the way, what is the join threshold you use in approx join? >> >> Could you perhaps create a JIRA ticket with the details in order to t

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread nguyen duc Tuan

nLSH is not yet in Spark master (see >> https://issues.apache.org/jira/browse/SPARK-18082). It could be there >> are some implementation details that would make a difference here. >> >> By the way, what is the join threshold you use in approx join? >> >> Could you pe

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-12 Thread nguyen duc Tuan

After all, I switched back to LSH implementation that I used before ( https://github.com/karlhigley/spark-neighbors ). I can run on my dataset now. If someone has any suggestion, please tell me. Thanks. 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan : > Hi Timur, > 1) Our data is transfor

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan

GMT+07:00 Nick Pentreath : > What other params are you using for the lsh transformer? > > Are the issues occurring during transform or during the similarity join? > > > On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan > wrote: > >> hi Das, >> In general, I will

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan

Debasish Das : > If it is 7m rows and 700k features (or say 1m features) brute force row > similarity will run fine as well...check out spark-4823...you can compare > quality with approximate variant... > On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" wrote: > >> Hi everyone,

Practical configuration to run LSH in Spark 2.1.0

2017-02-08 Thread nguyen duc Tuan

Hi everyone, Since spark 2.1.0 introduces LSH ( http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing), we want to use LSH to find approximately nearest neighbors. Basically, We have dataset with about 7M rows. we want to use cosine distance to meassure the similarity betw

Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-27 Thread nguyen duc Tuan

The simplest solution that I found: using an browser extension which do that for you :D. For example, if you are using Chrome, you can use this extension: https://chrome.google.com/webstore/detail/easy-auto-refresh/aabcgdmkeabbnleenpncegpcngjpnjkc/related?hl=en An other way, but a bit more manually

Re: How to recommend most similar users using Spark ML

2016-07-15 Thread nguyen duc Tuan

Hi jeremycod, If you want to find top N nearest neighbors for all users using exact top-k algorithm for all users, I recommend using the same approach as as used in Mllib : https://github.com/apache/spark/blob/85d6b0db9f5bd425c36482ffcb1c3b9fd0fcdb31/mllib/src/main/scala/org/apache/spark/mllib/rec

Re: how to add a column according to an existing column of a dataframe?

2016-06-30 Thread nguyen duc tuan

About spark issue that you refer to, it's is not related to your problem :D In this case, you only have to to is using withColumn function. For example: import org.apache.spark.sql.functions._ val getRange = udf((x: Int) => get price range code ...) val priceRange = resultPrice.withColumn("range",

Re: Verifying if DStream is empty

2016-06-20 Thread nguyen duc tuan

Stream 'result' is empty or not > and based on the result, perform some operation on input1Pair DStream and > input2Pair RDD. > > > On Mon, Jun 20, 2016 at 7:05 PM, nguyen duc tuan > wrote: > >> Hi Praseetha, >> In order to check if DStream is emp

Re: Verifying if DStream is empty

2016-06-20 Thread nguyen duc tuan

Hi Praseetha, In order to check if DStream is empty or not, using isEmpty method is correct. I think the problem here is calling input1Pair.lefOuterJoin(input2Pair). I guess input1Pair rdd comes from above transformation. You should do it on DStream instead. In this case, do any transformation with

Re: ImportError: No module named numpy

2016-06-02 Thread nguyen duc tuan

You should set both PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON the path to your python interpreter. 2016-06-02 20:32 GMT+07:00 Bhupendra Mishra : > did not resolved. :( > > On Thu, Jun 2, 2016 at 3:01 PM, Sergio Fernández > wrote: > >> >> On Thu, Jun 2, 2016 at 9:59 AM, Bhupendra Mishra < >> bh

Re: Behaviour of RDD sampling

2016-05-31 Thread nguyen duc tuan

Spark will load the whole dataset. The sampling action can be viewed as an filter. The real implementation can be more complicate, but I give you the idea by simple implementation. val rand = new Random(); val subRdd = rdd.filter(x => rand.nextDouble() <= 0.3) To prevent recomputing data, you c

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-31 Thread nguyen duc tuan

.foreachRDD(process_rdd) <<< As you can see here, no variable to > store the output from foreachRDD. My target is to get (pred, text) pair and > then use > > Whatever it is, the output from "extract_feature" is not what I want. > I will be more than happy if you please co

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-30 Thread nguyen duc tuan

t to > another place/context, like db,file etc. > I could use that but seems like over head of having another hop. > I wanted to make it simple and light. > > > On Tuesday, 31 May 2016, nguyen duc tuan wrote: > >> How about using foreachRDD ? I think this is much bett

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-30 Thread nguyen duc tuan

# Retrun will be tuple of (score,'original text') > return predictions > > > Hope, it will help somebody who is facing same problem. > If anybody has better idea, please post it here. > > -Obaid > > On Mon, May 30, 2016 at 8:43 PM, nguyen duc tuan &g

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-30 Thread nguyen duc tuan

How about this ? def extract_feature(rf_model, x): text = getFeatures(x).split(',') fea = [float(i) for i in text] prediction = rf_model.predict(fea) return (prediction, x) output = texts.map(lambda x: extract_feature(rf_model, x)) 2016-05-30 14:17 GMT+07:00 obaidul karim : > Hi, > > Anybody has

Re: job build cost more and more time

2016-05-25 Thread nguyen duc tuan

Take a look in here: http://stackoverflow.com/questions/33424445/is-there-a-way-to-checkpoint-apache-spark-dataframes So all you have to do create a checkpoint for a dataframe is as follow: df.rdd.checkpoint df.rdd.count // or any action 2016-05-25 8:43 GMT+07:00 naliazheli <754405...@qq.com>: >

Re: What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

2016-05-17 Thread nguyen duc tuan

There's no *RowSimilarity *method in RowMatrix class. You have to transpose your matrix to use that method. However, when the number of rows is large, this approach is still very slow. Try to use approximate nearest neighbor (ANN) methods instead such as LSH. There are several implements of LSH on

Re: Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-06 Thread nguyen duc tuan

Try to use Dataframe instead of RDD. Here's an introduction to Dataframe: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html 2016-05-06 21:52 GMT+07:00 pratik gawande : > Thanks Shao for quick reply. I will look into how pyspark jobs are > exe

Re: (YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

2016-04-29 Thread nguyen duc tuan

what does the WebUI show? What do you see when you click on "stderr" and "stdout" links ? These links must contain stdoutput and stderr for each executor. About your custom logging in executor, are you sure you checked "${spark. yarn.app.container.log.dir}/spark-app.log" Actual location of this fil

Re: (YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

2016-04-29 Thread nguyen duc tuan

These are executor's logs, not the driver logs. To see this log files, you have to go to executor machines where tasks is running. To see what you will print to stdout or stderr you can either go to the executor machines directly (will store in "stdout" and "stderr" files somewhere in the executor

Re: Compute

2016-04-27 Thread nguyen duc tuan

oint ids would). > > Here's an implementation of that idea in the context of finding nearest > neighbors: > > https://github.com/karlhigley/spark-neighbors/blob/master/src/main/scala/com/github/karlhigley/spark/neighbors/ANNModel.scala#L33-L34 > > Best, > Karl >

Compute

2016-04-27 Thread nguyen duc tuan

Hi all, Currently, I'm working on implementing LSH on spark. The problem leads to follow problem. I have an RDD[(Int, Int)] stores all pairs of ids of vectors need to compute distance and an other RDD[(Int, Vector)] stores all vectors with their ids. Can anyone suggest an efficiency way to compute

Re: [Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-25 Thread nguyen duc tuan

Maybe the problem is the data itself. For example, the first dataframe might has common keys in only one part of the second dataframe. I think you can verify if you are in this situation by repartition one dataframe and join it. If this is the true reason, you might see the result distributed more

Re: Element appear in both 2 splits of RDD after using randomSplit

2016-02-21 Thread nguyen duc tuan

That's very useful information. The reason for weird problem is because of the non-determination of RDD before applying randomSplit. By caching RDD, we can make RDD become deterministic and so problem is solved. Thank you for your help. 2016-02-21 11:12 GMT+07:00 Ted Yu : > Have you looked at: >

Re: parquet late column materialization

Re: parquet late column materialization

Re: How does spark work?

Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

Re: Kafka failover with multiple data centers

Kafka failover with multiple data centers

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Practical configuration to run LSH in Spark 2.1.0

Re: Is there anyway Spark UI is set to poll and refreshes itself

Re: How to recommend most similar users using Spark ML

Re: how to add a column according to an existing column of a dataframe?

Re: Verifying if DStream is empty

Re: Verifying if DStream is empty

Re: ImportError: No module named numpy

Re: Behaviour of RDD sampling

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

Re: Spark Streaming: Combine MLlib Prediction and Features on Dstreams

Re: job build cost more and more time

Re: What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

Re: Fw: Significant performance difference for same spark job in scala vs pyspark

Re: (YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

Re: (YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

Re: Compute

Compute

Re: [Spark 1.5.2]All data being written to only one part file rest part files are empty

Re: Element appear in both 2 splits of RDD after using randomSplit

32 matches

Site Navigation

Mail list logo

Footer information