答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread
ail.com> 发送时间: 2017年4月17日 16:48:47 收件人: 莫涛 抄送: user 主题: Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering? how about the event timeline on executors? It seems add more executor could help. 1. I found a jira(https://issues.apache.org/jira/browse/SPARK-11621) that state

答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread
It's hadoop archive. https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html 发件人: Alonso Isidoro Roman <alons...@gmail.com> 发送时间: 2017年4月20日 17:03:33 收件人: 莫涛 抄送: Jörn Franke; user@spark.apache.org 主题: Re: 答复: 答复: How to store 10M records in HDFS to sp

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread
t I expected: "only the requested BINARY are scanned". Moreover, HAR provides directly access to each record by hdfs shell command. Thank you very much! 发件人: Jörn Franke <jornfra...@gmail.com> 发送时间: 2017年4月17日 22:37:48 收件人: 莫涛 抄送: user@spark.ap

答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread
best knowledge, HBase works best for record around hundreds of KB and it requires extra work of the cluster administrator. So this would be the last option. Thanks! Mo Tao 发件人: Jörn Franke <jornfra...@gmail.com> 发送时间: 2017年4月17日 15:59:28 收件人: 莫涛 抄送

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread
f the given ID list. No partition could be skipped in the worst case. Mo Tao 发件人: Ryan <ryan.hd@gmail.com> 发送时间: 2017年4月17日 15:42:46 收件人: 莫涛 抄送: user 主题: Re: 答复: How to store 10M records in HDFS to speed up further filtering? 1. Per my understanding

答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread
ng I'm looking for! Could you kindly provide some links for reference? I found nothing in spark document about index or bloom filter working inside partition. Thanks very much! Mo Tao 发件人: Ryan <ryan.hd@gmail.com> 发送时间: 2017年4月17日 14:32:00 收件人: 莫涛 抄送: u

答复: 答复: how to generate a column using mapParition and then add it back to the df?

2016-08-08 Thread
Hi guha, Thanks a lot! This is perfectly what I want and I'll try to implement it. MoTao 发件人: ayan guha <guha.a...@gmail.com> 发送时间: 2016年8月8日 18:05:37 收件人: 莫涛 抄送: ndj...@gmail.com; user@spark.apache.org 主题: Re: 答复: how to generate a column using mapPa

答复: how to generate a column using mapParition and then add it back to the df?

2016-08-08 Thread
as possible as I can. Best 发件人: ndj...@gmail.com <ndj...@gmail.com> 发送时间: 2016年8月8日 17:16:27 收件人: 莫涛 抄送: user@spark.apache.org 主题: Re: how to generate a column using mapParition and then add it back to the df? Hi MoTao, What about broadcasting the model? Cheers,