forgive my ignorance, but, what does it mean HAR? a acronym to High available record?
Thanks Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> 2017-04-20 10:58 GMT+02:00 莫涛 <mo...@sensetime.com>: > Hi Jörn, > > > HAR is a great idea! > > > For POC, I've archived 1M records and stored the id -> path mapping in > text (for better readability). > > Filtering 1K records takes only 2 minutes now (30 seconds to get the path > list and 0.5 second per thread to read a record). > > Such performance is exactly what I expected: "only the requested BINARY > are scanned". > > Moreover, HAR provides directly access to each record by hdfs shell > command. > > > Thank you very much! > ------------------------------ > *发件人:* Jörn Franke <jornfra...@gmail.com> > *发送时间:* 2017年4月17日 22:37:48 > *收件人:* 莫涛 > *抄送:* user@spark.apache.org > *主题:* Re: 答复: How to store 10M records in HDFS to speed up further > filtering? > > Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc. > Maybe you can put the data in a HAR and store id, path in orc/parquet. > > On 17. Apr 2017, at 10:52, 莫涛 <mo...@sensetime.com> wrote: > > Hi Jörn, > > > I do think a 5 MB column is odd but I don't have any other idea before > asking this question. The binary data is a short video and the maximum > size is no more than 50 MB. > > > Hadoop archive sounds very interesting and I'll try it first to check > whether filtering is fast on it. > > > To my best knowledge, HBase works best for record around hundreds of KB > and it requires extra work of the cluster administrator. So this would be > the last option. > > > Thanks! > > > Mo Tao > ------------------------------ > *发件人:* Jörn Franke <jornfra...@gmail.com> > *发送时间:* 2017年4月17日 15:59:28 > *收件人:* 莫涛 > *抄送:* user@spark.apache.org > *主题:* Re: How to store 10M records in HDFS to speed up further filtering? > > You need to sort the data by id otherwise q situation can occur where the > index does not work. Aside from this, it sounds odd to put a 5 MB column > using those formats. This will be also not so efficient. > What is in the 5 MB binary data? > You could use HAR or maybe Hbase to store this kind of data (if it does > not get much larger than 5 MB). > > > On 17. Apr 2017, at 08:23, MoTao <mo...@sensetime.com> wrote: > > > > Hi all, > > > > I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on > > average. > > In my daily application, I need to filter out 10K BINARY according to an > ID > > list. > > How should I store the whole data to make the filtering faster? > > > > I'm using DataFrame in Spark 2.0.0 and I've tried row-based format (avro) > > and column-based format (orc). > > However, both of them require to scan almost ALL records, making the > > filtering stage very very slow. > > The code block for filtering looks like: > > > > val IDSet: Set[String] = ... > > val checkID = udf { ID: String => IDSet(ID) } > > spark.read.orc("/path/to/whole/data") > > .filter(checkID($"ID")) > > .select($"ID", $"BINARY") > > .write... > > > > Thanks for any advice! > > > > > > > > > > -- > > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to- > speed-up-further-filtering-tp28605.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > >