[jira] [Created] (HADOOP-11829) Improve the vector size of Bloom Filter from int to long, and storage from memory to disk

2015-04-13 Thread Hongbo Xu (JIRA)
Hongbo Xu created HADOOP-11829:
--

 Summary: Improve the vector size of Bloom Filter from int to long, 
and storage from memory to disk
 Key: HADOOP-11829
 URL: https://issues.apache.org/jira/browse/HADOOP-11829
 Project: Hadoop Common
  Issue Type: Improvement
  Components: util
Reporter: Hongbo Xu
Assignee: Hongbo Xu
Priority: Minor


org.apache.hadoop.util.bloom.BloomFilter(int vectorSize, int nbHash, int 
hashType) 
This filter almost can insert 900 million objects, when False Positives 
Probability is 0.0001, and it needs 2.1G RAM.
In My project, I needs established a filter which capacity is 2 billion, and it 
needs 4.7G RAM, the vector size is 38340233509, out the range of int, and I 
does not have so much RAM to do this, so I rebuild a big bloom filter which 
vector size type is long, and split the bit data to some files on disk, then 
distribute files to work node, and the performance is very good.
I think I can contribute this code to Hadoop Common, and a 128-bit Hash 
function (MurmurHash)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HADOOP-11829) Improve the vector size of Bloom Filter from int to long, and storage from memory to disk

2015-04-21 Thread Hongbo Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbo Xu resolved HADOOP-11829.

Resolution: Invalid

> Improve the vector size of Bloom Filter from int to long, and storage from 
> memory to disk
> -
>
> Key: HADOOP-11829
> URL: https://issues.apache.org/jira/browse/HADOOP-11829
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Reporter: Hongbo Xu
>Assignee: Hongbo Xu
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> org.apache.hadoop.util.bloom.BloomFilter(int vectorSize, int nbHash, int 
> hashType) 
> This filter almost can insert 900 million objects, when False Positives 
> Probability is 0.0001, and it needs 2.1G RAM.
> In My project, I needs established a filter which capacity is 2 billion, and 
> it needs 4.7G RAM, the vector size is 38340233509, out the range of int, and 
> I does not have so much RAM to do this, so I rebuild a big bloom filter which 
> vector size type is long, and split the bit data to some files on disk, then 
> distribute files to work node, and the performance is very good.
> I think I can contribute this code to Hadoop Common, and a 128-bit Hash 
> function (MurmurHash)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11829) Improve the vector size of Bloom Filter from int to long, and storage from memory to disk

2017-05-25 Thread Hongbo Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024323#comment-16024323
 ] 

Hongbo Xu commented on HADOOP-11829:


I'm sorry, I can not put the implement code online.
But it is very easy, just rebuild a big bloom filter which vector size type is 
long, and split the bit data to some files on disk.

> Improve the vector size of Bloom Filter from int to long, and storage from 
> memory to disk
> -
>
> Key: HADOOP-11829
> URL: https://issues.apache.org/jira/browse/HADOOP-11829
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Reporter: Hongbo Xu
>Assignee: Hongbo Xu
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> org.apache.hadoop.util.bloom.BloomFilter(int vectorSize, int nbHash, int 
> hashType) 
> This filter almost can insert 900 million objects, when False Positives 
> Probability is 0.0001, and it needs 2.1G RAM.
> In My project, I needs established a filter which capacity is 2 billion, and 
> it needs 4.7G RAM, the vector size is 38340233509, out the range of int, and 
> I does not have so much RAM to do this, so I rebuild a big bloom filter which 
> vector size type is long, and split the bit data to some files on disk, then 
> distribute files to work node, and the performance is very good.
> I think I can contribute this code to Hadoop Common, and a 128-bit Hash 
> function (MurmurHash)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-11829) Improve the vector size of Bloom Filter from int to long, and storage from memory to disk

2017-05-25 Thread Hongbo Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024357#comment-16024357
 ] 

Hongbo Xu commented on HADOOP-11829:


Each query need all data. the bit data is sequencing data, if you store all 
data one file, when you query a new entry, you must open the very big file, and 
seed to the position, split it to some small files with number file name, you 
can find you data quickly.

> Improve the vector size of Bloom Filter from int to long, and storage from 
> memory to disk
> -
>
> Key: HADOOP-11829
> URL: https://issues.apache.org/jira/browse/HADOOP-11829
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Reporter: Hongbo Xu
>Assignee: Hongbo Xu
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> org.apache.hadoop.util.bloom.BloomFilter(int vectorSize, int nbHash, int 
> hashType) 
> This filter almost can insert 900 million objects, when False Positives 
> Probability is 0.0001, and it needs 2.1G RAM.
> In My project, I needs established a filter which capacity is 2 billion, and 
> it needs 4.7G RAM, the vector size is 38340233509, out the range of int, and 
> I does not have so much RAM to do this, so I rebuild a big bloom filter which 
> vector size type is long, and split the bit data to some files on disk, then 
> distribute files to work node, and the performance is very good.
> I think I can contribute this code to Hadoop Common, and a 128-bit Hash 
> function (MurmurHash)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-11829) Improve the vector size of Bloom Filter from int to long, and storage from memory to disk

2017-05-25 Thread Hongbo Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025647#comment-16025647
 ] 

Hongbo Xu commented on HADOOP-11829:


YES

> Improve the vector size of Bloom Filter from int to long, and storage from 
> memory to disk
> -
>
> Key: HADOOP-11829
> URL: https://issues.apache.org/jira/browse/HADOOP-11829
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Reporter: Hongbo Xu
>Assignee: Hongbo Xu
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> org.apache.hadoop.util.bloom.BloomFilter(int vectorSize, int nbHash, int 
> hashType) 
> This filter almost can insert 900 million objects, when False Positives 
> Probability is 0.0001, and it needs 2.1G RAM.
> In My project, I needs established a filter which capacity is 2 billion, and 
> it needs 4.7G RAM, the vector size is 38340233509, out the range of int, and 
> I does not have so much RAM to do this, so I rebuild a big bloom filter which 
> vector size type is long, and split the bit data to some files on disk, then 
> distribute files to work node, and the performance is very good.
> I think I can contribute this code to Hadoop Common, and a 128-bit Hash 
> function (MurmurHash)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org