Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-07 Thread Hao Ren
ing `spark.read.parquet` API to read parquet files directly. Spark has partition-awareness for partitioned directories. But still, I would like to know if there is a way to leverage partition-awareness via Hive by using `spark.sql` API? Any help is highly appreciated! Thank you. -- Hao Ren

SparkSQL can not extract values from UDT (like VectorUDT)

2015-10-12 Thread Hao Ren
rk/sql/catalyst/expressions/complexTypeExtractors.scala#L49 It seems that the pattern matching does not take UDT into consideration. Is this an intended feature? If not, I would like to create a PR to fix it. -- Hao Ren Data Engineer @ leboncoin Paris, France

implict ClassTag in KafkaUtils

2015-12-17 Thread Hao Ren
cordClass) val cleanedHandler = jssc.sparkContext.clean(messageHandler.call _) createDirectStream[K, V, KD, VD, R]( jssc.ssc, Map(kafkaParams.toSeq: _*), Map(fromOffsets.mapValues { _.longValue() }.toSeq: _*), cleanedHandler ) } -- Hao Ren Data Engineer @ leboncoin Paris, France

Re: implict ClassTag in KafkaUtils

2015-12-17 Thread Hao Ren
ich is implied as context bound, Java does not have the > equivalence, so here change the java class to the ClassTag, and make it as > implicit value, it will be used by createDirectStream. > > > Thanks > Saisai > > > On Thu, Dec 17, 2015 at 9:49 PM, Hao Ren wrote: > >&

[MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Hao Ren
? -- Hao Ren Data Engineer @ leboncoin Paris, France

[SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-07 Thread Hao Ren
uot;key" === 2).show() // *It does not work as expected (org.apache.spark.SparkException: Task not serializable)* } run() } Also, I tried collect(), count(), first(), limit(). All of them worked without non-serializable exceptions. It seems only filter() throws the exception

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-08 Thread Hao Ren
Yes, it is. You can define a udf like that. Basically, it's a udf Int => Int which is a closure contains a non serializable object. The latter should cause Task not serializable exception. Hao On Mon, Aug 8, 2016 at 5:08 AM, Muthu Jayakumar wrote: > Hello Hao Ren, > >

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-08 Thread Hao Ren
Ints are >> serializable? >> >> >> >> Just thinking out loud >> >> >> >> Simon Scott >> >> >> >> Research Developer @ viavisolutions.com >> >> >> >> *From:* Hao Ren [mailto:inv...@gmail.com] >> *Sent:

S3 Read / Write makes executors deadlocked

2015-07-16 Thread Hao Ren
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- Hao Ren Data Engineer @ leboncoin Paris, France

Re: S3 Read / Write makes executors deadlocked

2015-07-16 Thread Hao Ren
mmon use case. Any help on this issue is highly appreciated. If you need more info, checkout the jira I created: https://issues.apache.org/jira/browse/SPARK-8869 On Thu, Jul 16, 2015 at 11:39 AM, Hao Ren wrote: > Given the following code which just reads from s3, then saves files to s3

[MLlib] BinaryLogisticRegressionSummary on test set

2015-09-17 Thread Hao Ren
to summary any data set we want. If there is a way to summary test set, please let me know. I have browsed LogisticRegression.scala, but failed to find one. Thx. -- Hao Ren Data Engineer @ leboncoin Paris, France

Re: [MLlib] BinaryLogisticRegressionSummary on test set

2015-09-18 Thread Hao Ren
gt;); > perhaps you could push for this to happen by creating a Jira and pinging > jkbradley and mengxr. Thanks! > > On Thu, Sep 17, 2015 at 8:07 AM, Hao Ren wrote: > >> Working on spark.ml.classification.LogisticRegression.scala (spark 1.5), >> >> It might be useful