forgive my ignorance, but, what does it mean HAR? a acronym to High
available record?

Thanks

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2017-04-20 10:58 GMT+02:00 莫涛 <mo...@sensetime.com>:

> Hi Jörn,
>
>
> HAR is a great idea!
>
>
> For POC, I've archived 1M records and stored the id -> path mapping in
> text (for better readability).
>
> Filtering 1K records takes only 2 minutes now (30 seconds to get the path
> list and 0.5 second per thread to read a record).
>
> Such performance is exactly what I expected: "only the requested BINARY
> are scanned".
>
> Moreover, HAR provides directly access to each record by hdfs shell
> command.
>
>
> Thank you very much!
> ------------------------------
> *发件人:* Jörn Franke <jornfra...@gmail.com>
> *发送时间:* 2017年4月17日 22:37:48
> *收件人:* 莫涛
> *抄送:* user@spark.apache.org
> *主题:* Re: 答复: How to store 10M records in HDFS to speed up further
> filtering?
>
> Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc.
> Maybe you can put the data in a HAR and store id, path in orc/parquet.
>
> On 17. Apr 2017, at 10:52, 莫涛 <mo...@sensetime.com> wrote:
>
> Hi Jörn,
>
>
> I do think a 5 MB column is odd but I don't have any other idea before
> asking this question. The binary data is a short video and the maximum
> size is no more than 50 MB.
>
>
> Hadoop archive sounds very interesting and I'll try it first to check
> whether filtering is fast on it.
>
>
> To my best knowledge, HBase works best for record around hundreds of KB
> and it requires extra work of the cluster administrator. So this would be
> the last option.
>
>
> Thanks!
>
>
> Mo Tao
> ------------------------------
> *发件人:* Jörn Franke <jornfra...@gmail.com>
> *发送时间:* 2017年4月17日 15:59:28
> *收件人:* 莫涛
> *抄送:* user@spark.apache.org
> *主题:* Re: How to store 10M records in HDFS to speed up further filtering?
>
> You need to sort the data by id otherwise q situation can occur where the
> index does not work. Aside from this, it sounds odd to put a 5 MB column
> using those formats. This will be also not so efficient.
> What is in the 5 MB binary data?
> You could use HAR or maybe Hbase to store this kind of data (if it does
> not get much larger than 5 MB).
>
> > On 17. Apr 2017, at 08:23, MoTao <mo...@sensetime.com> wrote:
> >
> > Hi all,
> >
> > I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on
> > average.
> > In my daily application, I need to filter out 10K BINARY according to an
> ID
> > list.
> > How should I store the whole data to make the filtering faster?
> >
> > I'm using DataFrame in Spark 2.0.0 and I've tried row-based format (avro)
> > and column-based format (orc).
> > However, both of them require to scan almost ALL records, making the
> > filtering stage very very slow.
> > The code block for filtering looks like:
> >
> > val IDSet: Set[String] = ...
> > val checkID = udf { ID: String => IDSet(ID) }
> > spark.read.orc("/path/to/whole/data")
> >  .filter(checkID($"ID"))
> >  .select($"ID", $"BINARY")
> >  .write...
> >
> > Thanks for any advice!
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to-
> speed-up-further-filtering-tp28605.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>

Reply via email to