Re: 答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Ryan
Hi mo, I don't think it needs shuffle cause the bloom filter only depends on data within each row group, not the whole data. But the HAR solution seems nice. I've thought of combining small files together and store the offsets.. not aware of hdfs has provided such functionality. And after some

答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
Hi Ryan, The attachment is the event timeline on executors. They are always busy computing. More executors are helpful but that's not my job as a developer. 1. The bad performance could be caused by my poor implementation, as "checkID" would not pushdown as a user defined function. 2. To

答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
It's hadoop archive. https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html 发件人: Alonso Isidoro Roman 发送时间: 2017年4月20日 17:03:33 收件人: 莫涛 抄送: Jörn Franke; user@spark.apache.org 主题: Re: 答复: 答复: How to store 10M records in HDFS to speed up

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Alonso Isidoro Roman
forgive my ignorance, but, what does it mean HAR? a acronym to High available record? Thanks Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2017-04-20 10:58

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
Hi Jörn, HAR is a great idea! For POC, I've archived 1M records and stored the id -> path mapping in text (for better readability). Filtering 1K records takes only 2 minutes now (30 seconds to get the path list and 0.5 second per thread to read a record). Such performance is exactly what I

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Jörn Franke
Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc. Maybe you can put the data in a HAR and store id, path in orc/parquet. > On 17. Apr 2017, at 10:52, 莫涛 wrote: > > Hi Jörn, > > > > I do think a 5 MB column is odd but I don't have any other idea

答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
Hi Jörn, I do think a 5 MB column is odd but I don't have any other idea before asking this question. The binary data is a short video and the maximum size is no more than 50 MB. Hadoop archive sounds very interesting and I'll try it first to check whether filtering is fast on it. To my

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
how about the event timeline on executors? It seems add more executor could help. 1. I found a jira(https://issues.apache.org/jira/browse/SPARK-11621) that states the ppd should work. And I think "only for matched ones the binary data is read" is true if proper index is configured. The row group

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
Hi Ryan, The attachment is a screen shot for the spark job and this is the only stage for this job. I've changed the partition size to 1GB by "--conf spark.sql.files.maxPartitionBytes=1073741824". 1. spark-orc seems not that smart. The input size is almost the whole data. I guess "only for

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
1. Per my understanding, for orc files, it should push down the filters, which means all id columns will be scanned but only for matched ones the binary data is read. I haven't dig into spark-orc reader though.. 2. orc itself have row group index and bloom filter index. you may try configurations

答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
Hi Ryan, 1. "expected qps and response time for the filter request" I expect that only the requested BINARY are scanned instead of all records, so the response time would be "10K * 5MB / disk read speed", or several times of this. In practice, our cluster has 30 SAS disks and scanning all