Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-05 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Oct. 5, 2016, 6:16 p.m.)


Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.


Changes
---

fix gradle warning


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

The system factory is used by Samza to instantiate SystemConsumer, 
SystemProducer, and SystemAdmin for a specific system. The 
FileDataSystemFactory can be reused for other file system like sources. 

HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
HDFS files need to be consumed for this job. The DirectoryPartitioner also uses 
“GroupingPattern” to group files into partitions if advanced partitioning is 
required. HDFSSystemAdmin will then persist the “PartitionDescriptor” to HDFS.

The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
Based on this information as well as the actual assignment of partitions, it 
would then know which files to read from.

The initial implementation of the HDFS system consumer supports only avro data 
files. It’s very easy to extend it to a variety of file format by implementing 
the FileReader interface.


  
 
+--+
 
 |  
| 
   +-+ HDFS 
| 
   |   Obtain|  
| 
   |  Partition  
+--+--^--+-^---+
 
   | Description|  |  | 
| 
   ||  |  | 
| 
   |  +-v---+  |  |   
Filtering/| 
   |  | |  |  +---+
Grouping +-+   
   |  | HDFSAvroFileReader  |  |  | 
  |   
   |  | |Persist   |  | 
  |   
   |  +-+---+   Partition  |  | 
  |   
   ||  Description |   
+--v--+ +--+--+
   ||  |   |
 | | |
   |  +-+---+  |   |Directory 
Partitioner| |   HDFSAvroWriter|
   |  | IFileReader |  |   |
 | | |
   |  | |  |   
+--+--+ +--+--+
   |  +-+---+  |  | 
  |   
   ||  |  | 
  |   
   ||  |  | 
  |   
   |  +-+---+
+-+--++   +--+--+
   |  | || 
|   | |
   |  | HDFSSystemConsumer  ||   HDFSSystemAdmin   
|   | HDFSSystemProducer  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  
  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-04 Thread Hai Lu


> On Sept. 29, 2016, 10:02 p.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 197
> > 
> >
> > Thinking of this more, I would prefer less dependency imposed between 
> > samza-yarn and samza-hdfs modules. Thinking of a case where HDFS consumer 
> > is used by a standalone Samza job, there is no YarnConfig object in the 
> > job. I think we should make this as required config for HdfsSystemConsumer, 
> > just like ZooKeeper connnect string is required for KafkaSystemConsumer.
> > 
> > Also, under which condition we need to clear the partition descriptor 
> > info in the staging dir? We need to think about the cleanup procedure as 
> > well.
> 
> Hai Lu wrote:
> We need to remove partition descriptors when job is done. Not doing so 
> would end up spamming user's HDFS space, causing immediate troubles to our 
> users. 
> 
> But right now there is no way that HdfsSystemConsumer/Admin would know 
> when the job is shutdown. So I don't see there is a solution if we don't 
> directly/indirectly depend on YARN, since only the YARN codes have this idea 
> of staging directory, and actually clean up the directory at the end of the 
> job.  I think what we really need to do, long term, is to support staging 
> direcotry in the Samza level, so that in addition to YARN, other platforms 
> like Docker, Mesos, Standalone can work as well.
> 
> Plus we have to keep in mind that only YARN has the kerberos support for 
> now. So currently HDFS systems ARE depending on YARN in that sense. Security 
> is one more thing to deal with (aside from staging directory) before we can 
> say HDFS sytems no long depends on YARN.
> 
> What do you think? I will keep this issue open.
> 
> Yi Pan (Data Infrastructure) wrote:
> There are two different levels of dependencies here: a) code-level 
> dependency that means the HdfsSystemConsumer code depends directly on 
> samza-yarn classes; b) config/semantic dependency that means some expected 
> behavior of a certain function (i.e. cleanup) depends on other modules. I 
> would prefer to remove the code-level dependency from the beginning. We can 
> still set the configuration of HdfsSystemConsumer to use the same staging 
> directory configuration from samza-yarn to achieve the cleanup function. This 
> means that HdfsSystemConsumer itself does not support after-completion 
> cleanup yet and depends on samza-yarn to clean up. It is a configure-level 
> dependency and we have the freedom to remove this w/o code change when either 
> a) HdfsSystemConsumer can cleanup the staging directory after end-of-stream; 
> b) staging directory config is moved to samza-core. Thoughts?

Discussed offline. Given that HDFS system consumer right now has to run on 
YARN, it's OK to explictly have the code dependency for now. Opened SAMZA-1032 
to track the work to move staging directory out of the YARN context.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150949
---


On Oct. 5, 2016, 12:02 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Oct. 5, 2016, 12:02 a.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-04 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Oct. 5, 2016, 12:02 a.m.)


Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

The system factory is used by Samza to instantiate SystemConsumer, 
SystemProducer, and SystemAdmin for a specific system. The 
FileDataSystemFactory can be reused for other file system like sources. 

HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
HDFS files need to be consumed for this job. The DirectoryPartitioner also uses 
“GroupingPattern” to group files into partitions if advanced partitioning is 
required. HDFSSystemAdmin will then persist the “PartitionDescriptor” to HDFS.

The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
Based on this information as well as the actual assignment of partitions, it 
would then know which files to read from.

The initial implementation of the HDFS system consumer supports only avro data 
files. It’s very easy to extend it to a variety of file format by implementing 
the FileReader interface.


  
 
+--+
 
 |  
| 
   +-+ HDFS 
| 
   |   Obtain|  
| 
   |  Partition  
+--+--^--+-^---+
 
   | Description|  |  | 
| 
   ||  |  | 
| 
   |  +-v---+  |  |   
Filtering/| 
   |  | |  |  +---+
Grouping +-+   
   |  | HDFSAvroFileReader  |  |  | 
  |   
   |  | |Persist   |  | 
  |   
   |  +-+---+   Partition  |  | 
  |   
   ||  Description |   
+--v--+ +--+--+
   ||  |   |
 | | |
   |  +-+---+  |   |Directory 
Partitioner| |   HDFSAvroWriter|
   |  | IFileReader |  |   |
 | | |
   |  | |  |   
+--+--+ +--+--+
   |  +-+---+  |  | 
  |   
   ||  |  | 
  |   
   ||  |  | 
  |   
   |  +-+---+
+-+--++   +--+--+
   |  | || 
|   | |
   |  | HDFSSystemConsumer  ||   HDFSSystemAdmin   
|   | HDFSSystemProducer  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  
   

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-03 Thread Yi Pan (Data Infrastructure)


> On Sept. 29, 2016, 10:02 p.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 197
> > 
> >
> > Thinking of this more, I would prefer less dependency imposed between 
> > samza-yarn and samza-hdfs modules. Thinking of a case where HDFS consumer 
> > is used by a standalone Samza job, there is no YarnConfig object in the 
> > job. I think we should make this as required config for HdfsSystemConsumer, 
> > just like ZooKeeper connnect string is required for KafkaSystemConsumer.
> > 
> > Also, under which condition we need to clear the partition descriptor 
> > info in the staging dir? We need to think about the cleanup procedure as 
> > well.
> 
> Hai Lu wrote:
> We need to remove partition descriptors when job is done. Not doing so 
> would end up spamming user's HDFS space, causing immediate troubles to our 
> users. 
> 
> But right now there is no way that HdfsSystemConsumer/Admin would know 
> when the job is shutdown. So I don't see there is a solution if we don't 
> directly/indirectly depend on YARN, since only the YARN codes have this idea 
> of staging directory, and actually clean up the directory at the end of the 
> job.  I think what we really need to do, long term, is to support staging 
> direcotry in the Samza level, so that in addition to YARN, other platforms 
> like Docker, Mesos, Standalone can work as well.
> 
> Plus we have to keep in mind that only YARN has the kerberos support for 
> now. So currently HDFS systems ARE depending on YARN in that sense. Security 
> is one more thing to deal with (aside from staging directory) before we can 
> say HDFS sytems no long depends on YARN.
> 
> What do you think? I will keep this issue open.

There are two different levels of dependencies here: a) code-level dependency 
that means the HdfsSystemConsumer code depends directly on samza-yarn classes; 
b) config/semantic dependency that means some expected behavior of a certain 
function (i.e. cleanup) depends on other modules. I would prefer to remove the 
code-level dependency from the beginning. We can still set the configuration of 
HdfsSystemConsumer to use the same staging directory configuration from 
samza-yarn to achieve the cleanup function. This means that HdfsSystemConsumer 
itself does not support after-completion cleanup yet and depends on samza-yarn 
to clean up. It is a configure-level dependency and we have the freedom to 
remove this w/o code change when either a) HdfsSystemConsumer can cleanup the 
staging directory after end-of-stream; b) staging directory config is moved to 
samza-core. Thoughts?


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150949
---


On Oct. 3, 2016, 4:54 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Oct. 3, 2016, 4:54 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-03 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Oct. 3, 2016, 4:54 p.m.)


Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.


Changes
---

End of stream support merge


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

The system factory is used by Samza to instantiate SystemConsumer, 
SystemProducer, and SystemAdmin for a specific system. The 
FileDataSystemFactory can be reused for other file system like sources. 

HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
HDFS files need to be consumed for this job. The DirectoryPartitioner also uses 
“GroupingPattern” to group files into partitions if advanced partitioning is 
required. HDFSSystemAdmin will then persist the “PartitionDescriptor” to HDFS.

The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
Based on this information as well as the actual assignment of partitions, it 
would then know which files to read from.

The initial implementation of the HDFS system consumer supports only avro data 
files. It’s very easy to extend it to a variety of file format by implementing 
the FileReader interface.


  
 
+--+
 
 |  
| 
   +-+ HDFS 
| 
   |   Obtain|  
| 
   |  Partition  
+--+--^--+-^---+
 
   | Description|  |  | 
| 
   ||  |  | 
| 
   |  +-v---+  |  |   
Filtering/| 
   |  | |  |  +---+
Grouping +-+   
   |  | HDFSAvroFileReader  |  |  | 
  |   
   |  | |Persist   |  | 
  |   
   |  +-+---+   Partition  |  | 
  |   
   ||  Description |   
+--v--+ +--+--+
   ||  |   |
 | | |
   |  +-+---+  |   |Directory 
Partitioner| |   HDFSAvroWriter|
   |  | IFileReader |  |   |
 | | |
   |  | |  |   
+--+--+ +--+--+
   |  +-+---+  |  | 
  |   
   ||  |  | 
  |   
   ||  |  | 
  |   
   |  +-+---+
+-+--++   +--+--+
   |  | || 
|   | |
   |  | HDFSSystemConsumer  ||   HDFSSystemAdmin   
|   | HDFSSystemProducer  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-03 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Oct. 3, 2016, 4:04 p.m.)


Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

The system factory is used by Samza to instantiate SystemConsumer, 
SystemProducer, and SystemAdmin for a specific system. The 
FileDataSystemFactory can be reused for other file system like sources. 

HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
HDFS files need to be consumed for this job. The DirectoryPartitioner also uses 
“GroupingPattern” to group files into partitions if advanced partitioning is 
required. HDFSSystemAdmin will then persist the “PartitionDescriptor” to HDFS.

The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
Based on this information as well as the actual assignment of partitions, it 
would then know which files to read from.

The initial implementation of the HDFS system consumer supports only avro data 
files. It’s very easy to extend it to a variety of file format by implementing 
the FileReader interface.


  
 
+--+
 
 |  
| 
   +-+ HDFS 
| 
   |   Obtain|  
| 
   |  Partition  
+--+--^--+-^---+
 
   | Description|  |  | 
| 
   ||  |  | 
| 
   |  +-v---+  |  |   
Filtering/| 
   |  | |  |  +---+
Grouping +-+   
   |  | HDFSAvroFileReader  |  |  | 
  |   
   |  | |Persist   |  | 
  |   
   |  +-+---+   Partition  |  | 
  |   
   ||  Description |   
+--v--+ +--+--+
   ||  |   |
 | | |
   |  +-+---+  |   |Directory 
Partitioner| |   HDFSAvroWriter|
   |  | IFileReader |  |   |
 | | |
   |  | |  |   
+--+--+ +--+--+
   |  +-+---+  |  | 
  |   
   ||  |  | 
  |   
   ||  |  | 
  |   
   |  +-+---+
+-+--++   +--+--+
   |  | || 
|   | |
   |  | HDFSSystemConsumer  ||   HDFSSystemAdmin   
|   | HDFSSystemProducer  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-02 Thread Hai Lu


> On Sept. 29, 2016, 10:06 p.m., Yi Pan (Data Infrastructure) wrote:
> > samza-shell/src/main/bash/run-job-for-azkaban.sh, line 1
> > 
> >
> > Question: why do we need this in open source? Don't we already have a 
> > run-job.sh in open source that is general for any YARN application?

Removed from open source.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150953
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |  |   
> +--+--+ +--+--+
>|  +-+---+  |  |   
> |   
>||  |  |   
> |   
>||  |  |   
> |   
>   

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-02 Thread Hai Lu


> On Sept. 29, 2016, 10:02 p.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java,
> >  line 17
> > 
> >
> > nit: I would recommend to test negative case where the offset is 
> > out-of-range as well.

Done


> On Sept. 29, 2016, 10:02 p.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java,
> >  line 98
> > 
> >
> > nit: need to validate that curFileIndex and curSingleFileOffset are 
> > within the range. Capture the out-of-range exceptions will help debugging.

Fixed and added tests


> On Sept. 29, 2016, 10:02 p.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java,
> >  line 119
> > 
> >
> > nit: A while loop is enough and cheaper than building up the recursive 
> > call stacks.

Done.


> On Sept. 29, 2016, 10:02 p.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 197
> > 
> >
> > Thinking of this more, I would prefer less dependency imposed between 
> > samza-yarn and samza-hdfs modules. Thinking of a case where HDFS consumer 
> > is used by a standalone Samza job, there is no YarnConfig object in the 
> > job. I think we should make this as required config for HdfsSystemConsumer, 
> > just like ZooKeeper connnect string is required for KafkaSystemConsumer.
> > 
> > Also, under which condition we need to clear the partition descriptor 
> > info in the staging dir? We need to think about the cleanup procedure as 
> > well.

We need to remove partition descriptors when job is done. Not doing so would 
end up spamming user's HDFS space, causing immediate troubles to our users. 

But right now there is no way that HdfsSystemConsumer/Admin would know when the 
job is shutdown. So I don't see there is a solution if we don't 
directly/indirectly depend on YARN, since only the YARN codes have this idea of 
staging directory, and actually clean up the directory at the end of the job.  
I think what we really need to do, long term, is to support staging direcotry 
in the Samza level, so that in addition to YARN, other platforms like Docker, 
Mesos, Standalone can work as well.

Plus we have to keep in mind that only YARN has the kerberos support for now. 
So currently HDFS systems ARE depending on YARN in that sense. Security is one 
more thing to deal with (aside from staging directory) before we can say HDFS 
sytems no long depends on YARN.

What do you think? I will keep this issue open.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150949
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150953
---




samza-shell/src/main/bash/run-job-for-azkaban.sh (line 1)


Question: why do we need this in open source? Don't we already have a 
run-job.sh in open source that is general for any YARN application?


- Yi Pan (Data Infrastructure)


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |  |   
> +--+--+ +--+--+
>|  +-+---+  |  |   
> |   
>||  |  |   
> |   
>||  |  |   
> |   
>|  +-+---+
> +-+--++   +--+--+

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150949
---


Fix it, then Ship it!




Overall looks pretty good to me. Just a few nits and questions on inter-module 
dependencies. Thanks!


samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 (line 95)


nit: need to validate that curFileIndex and curSingleFileOffset are within 
the range. Capture the out-of-range exceptions will help debugging.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 (line 114)


nit: A while loop is enough and cheaper than building up the recursive call 
stacks.



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala (line 
197)


Thinking of this more, I would prefer less dependency imposed between 
samza-yarn and samza-hdfs modules. Thinking of a case where HDFS consumer is 
used by a standalone Samza job, there is no YarnConfig object in the job. I 
think we should make this as required config for HdfsSystemConsumer, just like 
ZooKeeper connnect string is required for KafkaSystemConsumer.

Also, under which condition we need to clear the partition descriptor info 
in the staging dir? We need to think about the cleanup procedure as well.



samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 (line 99)


nit: typo. descriporMap --> descriptorMap



samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 (line 251)


nit: typo here as well: *updatedDescriptorMap*



samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 (line 17)


nit: I would recommend to test negative case where the offset is 
out-of-range as well.


- Yi Pan (Data Infrastructure)


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
> 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Yi Pan (Data Infrastructure)


> On Sept. 14, 2016, 6:19 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java,
> >  line 59
> > 
> >
> > Not sure what are we doing here? What's the ordering that we are 
> > enforcing in this multi-file partition? I saw that you are trying to make 
> > the offsets as an offset vector on top of all files in the same partition. 
> > Why? Can we simplify it by making it full-ordered in the same partition 
> > instead of partial-ordered via an offset vector?
> 
> Navina Ramesh wrote:
> I couldn't figure out the resolution for this issue. Perhaps you 
> discussed offline. Can you please update the discussion here for everyone's 
> benefit?
> 
> Yi Pan (Data Infrastructure) wrote:
> Yes, we discussed offline. I thought the conclusion is to drop the 
> multi-file per partition support now. @Hai, could you confirm?

Looks like the latest patch has already implemented a full-order multi-file 
partition. Hence, it is addressed.


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148780
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+ 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150895
---



Still in the middle (MultiFileHdfsReader). Will continue after lunch. BTW, 
overall looks very good! Thanks!


build.gradle (line 465)


I would strongly recommend to remove the dependency on samza-kafka and 
samza-yarn here. There are only very small portion of code/configuration that 
we needed in samza-hdfs and I would vote for less dependency on other modules 
here.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
(line 55)


Nice doc! Curious whether you use some tools for the ASCII graphs here?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 199)


So, we are not handling HdfsFileReader exceptions w/ retries yet, right? I 
think that it is fine for the first implementation. We should definitely log a 
JIRA for improvement here though, since I heard from Venice team that Hadoop 
actually is less reliable than Kafka.


- Yi Pan (Data Infrastructure)


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+   

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Navina Ramesh


> On Sept. 14, 2016, 6:19 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala,
> >  line 38
> > 
> >
> > Not related to your RB, but could you open a JIRA for this one? Using 
> > KafkaUtil class in HdfsSystemFactory seems really weird.
> 
> Yi Pan (Data Infrastructure) wrote:
> Just realized, w/ this dependency, are we creating a dependency on 
> samza-kafka in samza-hdfs? I don't think that is right. samza-kafka and 
> samza-hdfs should remain as two independent modules implementing different 
> SystemFactory and can not depend on each other. We definitely need to have a 
> JIRA addressing this one.

I think Hai created a JIRA for this yesterday. 
https://issues.apache.org/jira/browse/SAMZA-1026?jql=project%20%3D%20SAMZA%20AND%20created%3E%3D-1w%20ORDER%20BY%20created%20DESC


- Navina


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148780
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Yi Pan (Data Infrastructure)


> On Sept. 14, 2016, 6:19 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala,
> >  line 38
> > 
> >
> > Not related to your RB, but could you open a JIRA for this one? Using 
> > KafkaUtil class in HdfsSystemFactory seems really weird.

Just realized, w/ this dependency, are we creating a dependency on samza-kafka 
in samza-hdfs? I don't think that is right. samza-kafka and samza-hdfs should 
remain as two independent modules implementing different SystemFactory and can 
not depend on each other. We definitely need to have a JIRA addressing this one.


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148780
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |  |   
> +--+--+ +--+--+
>|  +-+---+  |

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Navina Ramesh


> On Sept. 29, 2016, 5:56 p.m., Prateek Maheshwari wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 66
> > 
> >
> > "systems.%s.consumer.buffer-capacity" makes sense to me. Regarding the 
> > "hdfs" prefix, there's already inconsistency in current configs. The kafka 
> > system configs don't include kafka in the config name, but the hdfs 
> > producer configs do. The kafka convention is better IMHO.
> > 
> > Either way, we should at least be consistent between this and the new 
> > partitioner/reader configs which don't have the hdfs prefix.
> 
> Prateek Maheshwari wrote:
> Btw, in the new configs we'll be using camelCase instead of dashes, so 
> we'll eventually need to change it it to bufferCapacity.
> 
> Hai Lu wrote:
> There won't be any problems if I change it to camelCase style now, right? 
> Or should I keep the dash?

You can use camelCase style. Going forward, we should strictly follow using 
period as namespace delimiter. So, it's fine to make this camelcase. Thanks, 
Hai!


- Navina


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150883
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
> 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Yi Pan (Data Infrastructure)


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 142
> > 
> >
> > Isn't it clearer to have one loop like below instead of two embedded 
> > loops:
> > while (!isShutdown) {
> >   if (!reader.hasNext()) {
> > break;
> >   }
> >   IncomingMessageEnvelope messageEnvelope = reader.readNext();
> >   try {
> >  super.put()
> >  ...
> >   } catch () {
> >  ...
> >   }
> > }
> 
> Hai Lu wrote:
> No. In your case, if the super.put() fails, your code will skip the 
> current event and read the next one. Unless you throw a runtime exception in 
> the catch block to completely stop the consumption.

Well, the super class is BlockingEnvelopeMap, which the put() will wait 
indefinitely until space is available. I don't see why we are re-trying in this 
case, if super.put() throws an exception. In KafkaSystemConsumer, if 
BlockingEnvelopeMap.put() fails w/ exception, the Kafka consumer (i.e. 
equivalent to the reader here) will be re-created. What do we exactly try to 
achieve here by re-tries?


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148612
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>|  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Hai Lu


> On Sept. 29, 2016, 5:56 p.m., Prateek Maheshwari wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 66
> > 
> >
> > "systems.%s.consumer.buffer-capacity" makes sense to me. Regarding the 
> > "hdfs" prefix, there's already inconsistency in current configs. The kafka 
> > system configs don't include kafka in the config name, but the hdfs 
> > producer configs do. The kafka convention is better IMHO.
> > 
> > Either way, we should at least be consistent between this and the new 
> > partitioner/reader configs which don't have the hdfs prefix.
> 
> Prateek Maheshwari wrote:
> Btw, in the new configs we'll be using camelCase instead of dashes, so 
> we'll eventually need to change it it to bufferCapacity.

There won't be any problems if I change it to camelCase style now, right? Or 
should I keep the dash?


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150883
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Navina Ramesh


> On Sept. 28, 2016, 12:28 a.m., Navina Ramesh wrote:
> > build.gradle, line 308
> > 
> >
> > why is this dependency needed here? It seems like this compile 
> > dependency is required for samza-hdfs and now samza-hdfs depends on 
> > samza-yarn. Is there a better way to do this?
> 
> Hai Lu wrote:
> It comes from the fact that HDFSConfig has a dependency on YarnConfig. 
> It's not a hard dependency as we can copy-paste the config key name. So far 
> we don't have a cleaner solution.
> 
> Yi Pan (Data Infrastructure) wrote:
> I think that I have made the suggestion to define a system-specific 
> configuration instead of "borrowing" from the samza-yarn configuration. It 
> would be much clearer and remove this unnecessary dependency.

Yeah. I meant to reply here. +1 to Yi. They are configs for independent 
components. So, it is ok to duplicate the config name here.


- Navina


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150648
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Prateek Maheshwari


> On Sept. 29, 2016, 10:56 a.m., Prateek Maheshwari wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 66
> > 
> >
> > "systems.%s.consumer.buffer-capacity" makes sense to me. Regarding the 
> > "hdfs" prefix, there's already inconsistency in current configs. The kafka 
> > system configs don't include kafka in the config name, but the hdfs 
> > producer configs do. The kafka convention is better IMHO.
> > 
> > Either way, we should at least be consistent between this and the new 
> > partitioner/reader configs which don't have the hdfs prefix.

Btw, in the new configs we'll be using camelCase instead of dashes, so we'll 
eventually need to change it it to bufferCapacity.


- Prateek


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150883
---


On Sept. 28, 2016, 2:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 2:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |  |   
> +--+

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Prateek Maheshwari

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150883
---




samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala (line 
66)


"systems.%s.consumer.buffer-capacity" makes sense to me. Regarding the 
"hdfs" prefix, there's already inconsistency in current configs. The kafka 
system configs don't include kafka in the config name, but the hdfs producer 
configs do. The kafka convention is better IMHO.

Either way, we should at least be consistent between this and the new 
partitioner/reader configs which don't have the hdfs prefix.


- Prateek Maheshwari


On Sept. 28, 2016, 2:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 2:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |  |   
> +--+--+ +--+--+
>|  +-+---+  |  |   
> |   
>||  |  |  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Yi Pan (Data Infrastructure)


> On Sept. 28, 2016, 12:28 a.m., Navina Ramesh wrote:
> > build.gradle, line 308
> > 
> >
> > why is this dependency needed here? It seems like this compile 
> > dependency is required for samza-hdfs and now samza-hdfs depends on 
> > samza-yarn. Is there a better way to do this?
> 
> Hai Lu wrote:
> It comes from the fact that HDFSConfig has a dependency on YarnConfig. 
> It's not a hard dependency as we can copy-paste the config key name. So far 
> we don't have a cleaner solution.

I think that I have made the suggestion to define a system-specific 
configuration instead of "borrowing" from the samza-yarn configuration. It 
would be much clearer and remove this unnecessary dependency.


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150648
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |  |   
> +--+--+ +--+---

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Yi Pan (Data Infrastructure)


> On Sept. 14, 2016, 6:19 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 66
> > 
> >
> > It would be nicer to make it conforming to Offspring style of config 
> > variable scoping. i.e. if the scope of configuration is for hdfs consumer, 
> > use systems.%s.consumer.hdfs.buffer-capacity. I would suggest to consult 
> > Prateek since he has been working on the Offspring config refactoring. For 
> > new config variables, "." should strictly be used as deliminator between 
> > scopes, not as deliminator between words.
> 
> Navina Ramesh wrote:
> Going by the logic of using period to delimit scopes, shouldn't it be 
> systems.%s.consumer.hdfs-buffer-capacity? Unless there is a hdfs scope that I 
> am not seeing. It is kind of weird because we assume the indirection from 
> systemname (%s) to its factory will act as a scope. I am not sure what the 
> correct pattern should be.

@Navina, yeah, agree. I can not say for myself as an Offspring expert either. 
Can we confirm w/ Prateek on the scoping convention here?


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148780
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Yi Pan (Data Infrastructure)


> On Sept. 14, 2016, 6:19 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java,
> >  line 59
> > 
> >
> > Not sure what are we doing here? What's the ordering that we are 
> > enforcing in this multi-file partition? I saw that you are trying to make 
> > the offsets as an offset vector on top of all files in the same partition. 
> > Why? Can we simplify it by making it full-ordered in the same partition 
> > instead of partial-ordered via an offset vector?
> 
> Navina Ramesh wrote:
> I couldn't figure out the resolution for this issue. Perhaps you 
> discussed offline. Can you please update the discussion here for everyone's 
> benefit?

Yes, we discussed offline. I thought the conclusion is to drop the multi-file 
per partition support now. @Hai, could you confirm?


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148780
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>   

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Yi Pan (Data Infrastructure)


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 58
> > 
> >
> > nit: since the input whiteList/blakcList are also regex, shouldn't we 
> > just name them the same?
> 
> Hai Lu wrote:
> Separating white list and black list simplifies the regex a lot. I see 
> this convention in databus, Kafka 
> (http://kafka.apache.org/documentation.html) and many other systems.

Oh, I was not suggesting merging the whitelist and blacklist parameters here. I 
was referring to rename them to reflect that they are regex as well. Something 
like: whiteListRegex and blackListRegex.


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |  |   
> +--+--+ +--+--

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-28 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Sept. 28, 2016, 7:45 p.m.)


Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

The system factory is used by Samza to instantiate SystemConsumer, 
SystemProducer, and SystemAdmin for a specific system. The 
FileDataSystemFactory can be reused for other file system like sources. 

HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
HDFS files need to be consumed for this job. The DirectoryPartitioner also uses 
“GroupingPattern” to group files into partitions if advanced partitioning is 
required. HDFSSystemAdmin will then persist the “PartitionDescriptor” to HDFS.

The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
Based on this information as well as the actual assignment of partitions, it 
would then know which files to read from.

The initial implementation of the HDFS system consumer supports only avro data 
files. It’s very easy to extend it to a variety of file format by implementing 
the FileReader interface.


  
 
+--+
 
 |  
| 
   +-+ HDFS 
| 
   |   Obtain|  
| 
   |  Partition  
+--+--^--+-^---+
 
   | Description|  |  | 
| 
   ||  |  | 
| 
   |  +-v---+  |  |   
Filtering/| 
   |  | |  |  +---+
Grouping +-+   
   |  | HDFSAvroFileReader  |  |  | 
  |   
   |  | |Persist   |  | 
  |   
   |  +-+---+   Partition  |  | 
  |   
   ||  Description |   
+--v--+ +--+--+
   ||  |   |
 | | |
   |  +-+---+  |   |Directory 
Partitioner| |   HDFSAvroWriter|
   |  | IFileReader |  |   |
 | | |
   |  | |  |   
+--+--+ +--+--+
   |  +-+---+  |  | 
  |   
   ||  |  | 
  |   
   ||  |  | 
  |   
   |  +-+---+
+-+--++   +--+--+
   |  | || 
|   | |
   |  | HDFSSystemConsumer  ||   HDFSSystemAdmin   
|   | HDFSSystemProducer  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  
 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-28 Thread Hai Lu


> On Sept. 28, 2016, 12:28 a.m., Navina Ramesh wrote:
> > build.gradle, line 308
> > 
> >
> > why is this dependency needed here? It seems like this compile 
> > dependency is required for samza-hdfs and now samza-hdfs depends on 
> > samza-yarn. Is there a better way to do this?

It comes from the fact that HDFSConfig has a dependency on YarnConfig. It's not 
a hard dependency as we can copy-paste the config key name. So far we don't 
have a cleaner solution.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150648
---


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |  |   
> +--+--+ +--+--+
>|  +-+---+  |  |   
> |   
>||  |  |   
>   

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-28 Thread Hai Lu


> On Sept. 27, 2016, 11:11 p.m., Navina Ramesh wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 70
> > 
> >
> > what is the "default-partitioner"? Is it possible to have more than one 
> > partitioner?

Per my dicussion with Yi, we are simply allowing the possibility of more than 
one partitioner in the future by reflecting in the config name. We are not 
going to change samza-api to make it actually happen for now.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150637
---


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |  |   
> +--+--+ +--+--+
>|  +-+---+  |  |   
> |   
>||  |  |   
>

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-27 Thread Navina Ramesh

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150648
---


Fix it, then Ship it!




lgtm +1 .. I think you were planning to add documentation with a separate 
JIRA/RB . Correct?


build.gradle (line 308)


why is this dependency needed here? It seems like this compile dependency 
is required for samza-hdfs and now samza-hdfs depends on samza-yarn. Is there a 
better way to do this?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 (line 53)


It is important to document the assumption that we consider the HDFS file 
set to be immutable and how we handle inconsistencies. Looks like you validate 
and throw exception in validateAndGetOriginalFilteredFiles. 
Sorry about nagging regarding documentation. This feels like a complicated 
class, where we may easily forget our design assumptions. Better to clarify it 
in the doc.



samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 (line 42)


Not related to you change. But can you clean up some unused imports before 
you commit this file? Thanks!



samza-shell/src/main/bash/bash-run-job.sh (line 27)


If we want to use this only for the azkaban runner, we should perhaps 
rename the file as run-job-for-azkaban.sh or something on those lines.


- Navina Ramesh


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +---

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-27 Thread Navina Ramesh


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java,
> >  line 24
> > 
> >
> > One concern I had w/ this HdfsAvroFileReader/Writer is the version 
> > conflict issue. LinkedIn's Kafka version still uses avro-1.4 in the serde, 
> > while hdfs already uses avro-1.7 in 2.6.1. I guess that we need to find a 
> > solution inside LinkedIn to resolve it. Let's sync up face-to-face tomorrow.
> 
> Hai Lu wrote:
> I was well aware of the avro issue. I tried so many different APIs that I 
> finally found the set of APIs that work for both 1.4 and 1.7
> 
> Yi Pan (Data Infrastructure) wrote:
> Great! I am really curious what are the set of compatible APIs! So, I 
> guess that we just enforce avro-1.4 when compiling samza-hdfs module? I 
> remember that I tried last time and got a build failure in samza-hdfs w/ 
> AvroDataFileHdfsWriter in samza-li build. I am curious how you made it work.
> 
> Navina Ramesh wrote:
> Right now, we exclude samza-hdfs build in samza-li. 
>   "build": "ligradle -PscalaVersion=2.10 -Prelease=true 
> -PallArtifacts build -x:samza-hdfs_2.10:build",
>   
> We may want to fully understand the avro changes introduced by 
> HdfsProducer and/or HdfsConsumer in samza-li. This sounds like a blocker for 
> me right now. How are we going to overcome avro conflict introduced in 
> HdfsSystemProducer?
> 
> Hai Lu wrote:
> I know. I included it back in samza-li and it worked just fine. Just need 
> some extra dependency to make the tests pass. I have been only using li_trunk 
> to deploy to Hadoop's YARN at LinkedIn

Got it. Thanks!


- Navina


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Group

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-27 Thread Navina Ramesh


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 101
> > 
> >
> > Better, to avoid the wasteful remote IO if there are multiple calls to 
> > getPartitionDescriptor from multiple threads, is to use bucketized locks on 
> > the ConcurrentHashMap entries to ensure synchronization in populating a 
> > certain hash map entry. Guava cache implemented the bucketized locking as a 
> > built-in method already: 
> > http://www.tutorialspoint.com/guava/guava_caching_utilities.htm
> 
> Navina Ramesh wrote:
> What was the resolution here? Was there any change to the IO pattern to 
> use caching?
> 
> Hai Lu wrote:
> I believe this is just to optimize the situation that multiple calls 
> happen at the same time and causing everyone making remote calls. After the 
> change here, only the first one will actually make the remote call while 
> everyone else be blocked.
> 
> It's a very very tiny improvement, to be honest.

Ah.Got it.. Actually I was looking for the change in the diff window and I 
couldn't figure out where you have used. I understood once I applied your patch 
in my IDE. Thanks!

Yes . It is tiny but important improvement.


- Navina


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148612
---


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>|   

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-27 Thread Navina Ramesh


> On Sept. 14, 2016, 6:19 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java,
> >  line 59
> > 
> >
> > Not sure what are we doing here? What's the ordering that we are 
> > enforcing in this multi-file partition? I saw that you are trying to make 
> > the offsets as an offset vector on top of all files in the same partition. 
> > Why? Can we simplify it by making it full-ordered in the same partition 
> > instead of partial-ordered via an offset vector?

I couldn't figure out the resolution for this issue. Perhaps you discussed 
offline. Can you please update the discussion here for everyone's benefit?


> On Sept. 14, 2016, 6:19 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 66
> > 
> >
> > It would be nicer to make it conforming to Offspring style of config 
> > variable scoping. i.e. if the scope of configuration is for hdfs consumer, 
> > use systems.%s.consumer.hdfs.buffer-capacity. I would suggest to consult 
> > Prateek since he has been working on the Offspring config refactoring. For 
> > new config variables, "." should strictly be used as deliminator between 
> > scopes, not as deliminator between words.

Going by the logic of using period to delimit scopes, shouldn't it be 
systems.%s.consumer.hdfs-buffer-capacity? Unless there is a hdfs scope that I 
am not seeing. It is kind of weird because we assume the indirection from 
systemname (%s) to its factory will act as a scope. I am not sure what the 
correct pattern should be.


- Navina


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148780
---


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Gro

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-27 Thread Navina Ramesh

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150637
---




samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 56)


Can you please add javadoc related to thread-safety of the class?



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala (line 
70)


what is the "default-partitioner"? Is it possible to have more than one 
partitioner?



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
(line 37)


Doesn't this add a dependency between samza-hdfs and samza-kafka?

It seems to have been introduced by the HdfsSystemProducer. Can we please 
fix it forward? Or as Yi suggested, please create a JIRA and add a TODO comment 
here referring to the JIRA

Thanks!


- Navina Ramesh


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-27 Thread Hai Lu


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java,
> >  line 24
> > 
> >
> > One concern I had w/ this HdfsAvroFileReader/Writer is the version 
> > conflict issue. LinkedIn's Kafka version still uses avro-1.4 in the serde, 
> > while hdfs already uses avro-1.7 in 2.6.1. I guess that we need to find a 
> > solution inside LinkedIn to resolve it. Let's sync up face-to-face tomorrow.
> 
> Hai Lu wrote:
> I was well aware of the avro issue. I tried so many different APIs that I 
> finally found the set of APIs that work for both 1.4 and 1.7
> 
> Yi Pan (Data Infrastructure) wrote:
> Great! I am really curious what are the set of compatible APIs! So, I 
> guess that we just enforce avro-1.4 when compiling samza-hdfs module? I 
> remember that I tried last time and got a build failure in samza-hdfs w/ 
> AvroDataFileHdfsWriter in samza-li build. I am curious how you made it work.
> 
> Navina Ramesh wrote:
> Right now, we exclude samza-hdfs build in samza-li. 
>   "build": "ligradle -PscalaVersion=2.10 -Prelease=true 
> -PallArtifacts build -x:samza-hdfs_2.10:build",
>   
> We may want to fully understand the avro changes introduced by 
> HdfsProducer and/or HdfsConsumer in samza-li. This sounds like a blocker for 
> me right now. How are we going to overcome avro conflict introduced in 
> HdfsSystemProducer?

I know. I included it back in samza-li and it worked just fine. Just need some 
extra dependency to make the tests pass. I have been only using li_trunk to 
deploy to Hadoop's YARN at LinkedIn


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-27 Thread Hai Lu


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 101
> > 
> >
> > Better, to avoid the wasteful remote IO if there are multiple calls to 
> > getPartitionDescriptor from multiple threads, is to use bucketized locks on 
> > the ConcurrentHashMap entries to ensure synchronization in populating a 
> > certain hash map entry. Guava cache implemented the bucketized locking as a 
> > built-in method already: 
> > http://www.tutorialspoint.com/guava/guava_caching_utilities.htm
> 
> Navina Ramesh wrote:
> What was the resolution here? Was there any change to the IO pattern to 
> use caching?

I believe this is just to optimize the situation that multiple calls happen at 
the same time and causing everyone making remote calls. After the change here, 
only the first one will actually make the remote call while everyone else be 
blocked.

It's a very very tiny improvement, to be honest.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148612
---


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+-

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-27 Thread Navina Ramesh


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 101
> > 
> >
> > Better, to avoid the wasteful remote IO if there are multiple calls to 
> > getPartitionDescriptor from multiple threads, is to use bucketized locks on 
> > the ConcurrentHashMap entries to ensure synchronization in populating a 
> > certain hash map entry. Guava cache implemented the bucketized locking as a 
> > built-in method already: 
> > http://www.tutorialspoint.com/guava/guava_caching_utilities.htm

What was the resolution here? Was there any change to the IO pattern to use 
caching?


- Navina


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148612
---


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |  |   
> +--+--+ +--+

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-27 Thread Navina Ramesh


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java,
> >  line 24
> > 
> >
> > One concern I had w/ this HdfsAvroFileReader/Writer is the version 
> > conflict issue. LinkedIn's Kafka version still uses avro-1.4 in the serde, 
> > while hdfs already uses avro-1.7 in 2.6.1. I guess that we need to find a 
> > solution inside LinkedIn to resolve it. Let's sync up face-to-face tomorrow.
> 
> Hai Lu wrote:
> I was well aware of the avro issue. I tried so many different APIs that I 
> finally found the set of APIs that work for both 1.4 and 1.7
> 
> Yi Pan (Data Infrastructure) wrote:
> Great! I am really curious what are the set of compatible APIs! So, I 
> guess that we just enforce avro-1.4 when compiling samza-hdfs module? I 
> remember that I tried last time and got a build failure in samza-hdfs w/ 
> AvroDataFileHdfsWriter in samza-li build. I am curious how you made it work.

Right now, we exclude samza-hdfs build in samza-li. 
  "build": "ligradle -PscalaVersion=2.10 -Prelease=true -PallArtifacts 
build -x:samza-hdfs_2.10:build",
  
We may want to fully understand the avro changes introduced by HdfsProducer 
and/or HdfsConsumer in samza-li. This sounds like a blocker for me right now. 
How are we going to overcome avro conflict introduced in HdfsSystemProducer?


- Navina


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
> 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-20 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Sept. 20, 2016, 11:22 p.m.)


Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

The system factory is used by Samza to instantiate SystemConsumer, 
SystemProducer, and SystemAdmin for a specific system. The 
FileDataSystemFactory can be reused for other file system like sources. 

HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
HDFS files need to be consumed for this job. The DirectoryPartitioner also uses 
“GroupingPattern” to group files into partitions if advanced partitioning is 
required. HDFSSystemAdmin will then persist the “PartitionDescriptor” to HDFS.

The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
Based on this information as well as the actual assignment of partitions, it 
would then know which files to read from.

The initial implementation of the HDFS system consumer supports only avro data 
files. It’s very easy to extend it to a variety of file format by implementing 
the FileReader interface.


  
 
+--+
 
 |  
| 
   +-+ HDFS 
| 
   |   Obtain|  
| 
   |  Partition  
+--+--^--+-^---+
 
   | Description|  |  | 
| 
   ||  |  | 
| 
   |  +-v---+  |  |   
Filtering/| 
   |  | |  |  +---+
Grouping +-+   
   |  | HDFSAvroFileReader  |  |  | 
  |   
   |  | |Persist   |  | 
  |   
   |  +-+---+   Partition  |  | 
  |   
   ||  Description |   
+--v--+ +--+--+
   ||  |   |
 | | |
   |  +-+---+  |   |Directory 
Partitioner| |   HDFSAvroWriter|
   |  | IFileReader |  |   |
 | | |
   |  | |  |   
+--+--+ +--+--+
   |  +-+---+  |  | 
  |   
   ||  |  | 
  |   
   ||  |  | 
  |   
   |  +-+---+
+-+--++   +--+--+
   |  | || 
|   | |
   |  | HDFSSystemConsumer  ||   HDFSSystemAdmin   
|   | HDFSSystemProducer  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-20 Thread Hai Lu


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 83
> > 
> >
> > Question: this seems to be highly related to how the HDFS files are 
> > organized. It is hard to see how a common practice would look like, 
> > especially in open source. Can we make the groupIdentifier pluggable?
> 
> Hai Lu wrote:
> Why is it HDFS specific? At the very least, it can apply to any file 
> system like systems. The idea of grouping (or advanced partitioning) is to 
> allow multiple highly related files or, say AWS S3Objects, to be processed by 
> the same task.
> 
> Anyway, this is sort of pluggable already. If you don't specify 
> "group.pattern" then the group identifier will be the entire file name (i.e. 
> each group will simply be each single file themselves).
> 
> Yi Pan (Data Infrastructure) wrote:
> If the intended implementation is for a generic Partitioner for some 
> non-partitioned data source, we would need to add to the samza-api as a 
> general Partitioner interface, and then add the DirectoryPartitioner as an 
> implementation of the Partitioner interface in yarn package. Ideally, we 
> would need to make the Partitioner class also confgiurable, s.t. user can 
> implement their own customized Partitioner. As a first step, I would agree 
> that we don't expose to the user as a public API and use DirectoryPartitioner 
> as a default implementation. But it would be nice to have the configuration 
> following the scope:
> systems.%s.partitioner..class, 
> systems.%s.partitioner..group-pattern. Let's discuss in 
> person tomorrow.

Per our discussion, will do the config part and skip the samza-api change for 
now.


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java,
> >  line 152
> > 
> >
> > Ideally, shouldn't this class also include a avroFilePath variable to 
> > ensure that we never compare checkpoints for two different files?

I don't want this class to know anything about multiple files at all. Just like 
in kafkaSystemAdmin we simply compare two LONGs, we don't do 
system/stream/partition check. So similarly I will do the file path check 
upstream, which is going to be the hdfsSystemAdmin.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-20 Thread Hai Lu


> On Sept. 14, 2016, 6:19 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java,
> >  line 183
> > 
> >
> > It would be better to ensure that the index used in each partition is 
> > referring to the sorted list from inputFiles, to guarantee the consistent 
> > ordering/indexing of the same file in the list of input files. Is it 
> > possible that a new file is inserted in the middle of the sorted list? 
> > Maybe sort by create time to ensure the files are always appended to the 
> > greatest index in the sorted list?

This seems to be addressed if we perform validation on the partition 
descriptors that we persisted.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148780
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |  |   
> +--+---

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-14 Thread Hai Lu


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 142
> > 
> >
> > Isn't it clearer to have one loop like below instead of two embedded 
> > loops:
> > while (!isShutdown) {
> >   if (!reader.hasNext()) {
> > break;
> >   }
> >   IncomingMessageEnvelope messageEnvelope = reader.readNext();
> >   try {
> >  super.put()
> >  ...
> >   } catch () {
> >  ...
> >   }
> > }

No. In your case, if the super.put() fails, your code will skip the current 
event and read the next one. Unless you throw a runtime exception in the catch 
block to completely stop the consumption.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148612
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |   HDFSAvroWriter|
>|  | IFileReader |  |   |  
>| | |
>|  | |

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-14 Thread Yi Pan (Data Infrastructure)


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java,
> >  line 24
> > 
> >
> > One concern I had w/ this HdfsAvroFileReader/Writer is the version 
> > conflict issue. LinkedIn's Kafka version still uses avro-1.4 in the serde, 
> > while hdfs already uses avro-1.7 in 2.6.1. I guess that we need to find a 
> > solution inside LinkedIn to resolve it. Let's sync up face-to-face tomorrow.
> 
> Hai Lu wrote:
> I was well aware of the avro issue. I tried so many different APIs that I 
> finally found the set of APIs that work for both 1.4 and 1.7

Great! I am really curious what are the set of compatible APIs! So, I guess 
that we just enforce avro-1.4 when compiling samza-hdfs module? I remember that 
I tried last time and got a build failure in samza-hdfs w/ 
AvroDataFileHdfsWriter in samza-li build. I am curious how you made it work.


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>||  |   |  
>| | |
>|  +-+---+  |   |Directory 
> Partitioner| |

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-14 Thread Yi Pan (Data Infrastructure)


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 83
> > 
> >
> > Question: this seems to be highly related to how the HDFS files are 
> > organized. It is hard to see how a common practice would look like, 
> > especially in open source. Can we make the groupIdentifier pluggable?
> 
> Hai Lu wrote:
> Why is it HDFS specific? At the very least, it can apply to any file 
> system like systems. The idea of grouping (or advanced partitioning) is to 
> allow multiple highly related files or, say AWS S3Objects, to be processed by 
> the same task.
> 
> Anyway, this is sort of pluggable already. If you don't specify 
> "group.pattern" then the group identifier will be the entire file name (i.e. 
> each group will simply be each single file themselves).

If the intended implementation is for a generic Partitioner for some 
non-partitioned data source, we would need to add to the samza-api as a general 
Partitioner interface, and then add the DirectoryPartitioner as an 
implementation of the Partitioner interface in yarn package. Ideally, we would 
need to make the Partitioner class also confgiurable, s.t. user can implement 
their own customized Partitioner. As a first step, I would agree that we don't 
expose to the user as a public API and use DirectoryPartitioner as a default 
implementation. But it would be nice to have the configuration following the 
scope:
systems.%s.partitioner..class, 
systems.%s.partitioner..group-pattern. Let's discuss in 
person tomorrow.


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 162
> > 
> >
> > We had an on-going issue that the partitionId to the ssp mapping does 
> > not seem to be consistent between the job restarts. I suspect that might be 
> > a problem here as well, if the groupedPartitions list is not sorted in a 
> > consistent order?
> 
> Hai Lu wrote:
> Wait, that would be a huge issue of Samza... I don't understand how is 
> the mapping between partition id and ssp not consistent? The ssp contains the 
> partition id itself, right?

Sorry, should be more crystal clear on this one. The inconsistency exists in 
the mapping between the ssps to the taskNames (which in samza, when using 
GroupByPartitionGrouper, is in the format of Partition_1). SAMZA-1012 is the 
bug reporting that. It does not affect the group membership (i.e. ssp A.1 and 
ssp.B.1 are still grouped together), but may affect the taskName for the group. 
The issue I want to poke here is: can it happen here as well? Say files AX.1 
and AX.2 are always in the same group, but due to different ordering in the 
groupedPartitions, it is assigned different partitionIds in two different runs?


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variet

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-13 Thread Yi Pan (Data Infrastructure)


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java, 
> > line 91
> > 
> >
> > You can do:
> > try(FSDataOutputStream fos = fs.create(targetPath)) {
> >   fos.write(PartitionDescriptionUtil.toJson();
> >   }
> 
> Hai Lu wrote:
> Do you just intend to narrow down the try block? the "getFileSystem" 
> method above will also throw IOExecption, so I have to include everything 
> here.

I was referring to the pattern of try-with-resource in JDK7. See: 
https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java, 
> > line 105
> > 
> >
> > What if the PartitionDescriptor already exists? Could it be the case 
> > that the systemStreamMetadata maintains a different copy of 
> > PartitionDescriptor? It is not clear to me which one is the source of 
> > truth? directoryPartitioner.getPartitionMetadataMap()? Or 
> > directoryPartitioner.getPartitionDescriptor()? Maybe I miss some basic 
> > information regarding to the concept on PartitionDescriptor vs 
> > PartitionMetadataMap?
> 
> Hai Lu wrote:
> You are right. This is an unsolved problem given that we assume the HDFS 
> folder is immutable. So now, what if the HDFS folder really is altered. 
> Before we support the mutable HFDS input,  I think we have two options: 1. 
> Throw an runtime exception to stop the job if we see inconsistency 2. Ignore 
> the newer folder info we got and always treat the partition descriptors that 
> we persisted on HDFS as the source of truth. Use the source of truth to 
> proceed with the job and don't kill it or throw exception.
> Do you have a preference?

Actually, before we support mutable HDFS input, I think we need to do the 
following:
- if folder info is not consistent w/ partition descriptors, validate partition 
descriptors to make sure all files exist (i.e. the descriptor is still valid) 
and then continue w/ the existing partition descriptor (assuming it is 
immutable)
- if validation failed, throw an exception to stop the job, since the original 
immutable set of partition descriptor is no longer valid


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 69
> > 
> >
> > I would prefer to follow the same pattern as KafkaSystemConsumer, i.e. 
> > passing in the HdfsSystemConsumerMetrics object instead of the 
> > MetricsRegistry. Please check the code in KafkaSystemFactory.getConsumer() 
> > to see how the metrics object are created and passed along.
> 
> Hai Lu wrote:
> I can't avoid directly using MetricsRegistry. At the very least, I have 
> to pass this to the base class: BlockingEnvelopeMap (unless you want to 
> change all the interface in BlockingEnvelopMap as well. Event if we wan t to 
> do that. It seems beyond the scope of this RB. I can maybe to a separte fix 
> for that.). But I will create a HdfsSystemConsumerMetrics.

Please check code in KafkaSystemConsumer. It extends BlockingEnvelopeMap as 
well and use metrics.registry to get the MetricsRegistry to pass into 
BlockingEnvelopeMap's constructor. No need to change any interface.


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148612
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-13 Thread Hai Lu


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 58
> > 
> >
> > nit: since the input whiteList/blakcList are also regex, shouldn't we 
> > just name them the same?

Separating white list and black list simplifies the regex a lot. I see this 
convention in databus, Kafka (http://kafka.apache.org/documentation.html) and 
many other systems.


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 77
> > 
> >
> > Make sure that you check-in to open source after we disable JDK7 build. 
> > This won't work for JDK7 build in open source.

Will do.


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 83
> > 
> >
> > Question: this seems to be highly related to how the HDFS files are 
> > organized. It is hard to see how a common practice would look like, 
> > especially in open source. Can we make the groupIdentifier pluggable?

Why is it HDFS specific? At the very least, it can apply to any file system 
like systems. The idea of grouping (or advanced partitioning) is to allow 
multiple highly related files or, say AWS S3Objects, to be processed by the 
same task.

Anyway, this is sort of pluggable already. If you don't specify "group.pattern" 
then the group identifier will be the entire file name (i.e. each group will 
simply be each single file themselves).


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 162
> > 
> >
> > We had an on-going issue that the partitionId to the ssp mapping does 
> > not seem to be consistent between the job restarts. I suspect that might be 
> > a problem here as well, if the groupedPartitions list is not sorted in a 
> > consistent order?

Wait, that would be a huge issue of Samza... I don't understand how is the 
mapping between partition id and ssp not consistent? The ssp contains the 
partition id itself, right?


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 174
> > 
> >
> > Now I see what this PartitionDescriptor really mean... Is it much more 
> > straightforward if renamed to partitionToFilePathsMap?

I started to realize that the name of partition descriptor isn't informative 
enough. I have updated the design doc to list it in the glossary section. My 
problem with "partitionToFilePath(s)" is just that, well, it's not a noun. 
"Descriptor" is more concise.


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java,
> >  line 24
> > 
> >
> > One concern I had w/ this HdfsAvroFileReader/Writer is the version 
> > conflict issue. LinkedIn's Kafka version still uses avro-1.4 in the serde, 
> > while hdfs already uses avro-1.7 in 2.6.1. I guess that we need to find a 
> > solution inside LinkedIn to resolve it. Let's sync up face-to-face tomorrow.

I was well aware of the avro issue. I tried so many different APIs that I 
finally found the set of APIs that work for both 1.4 and 1.7


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the hi

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-13 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148780
---



Thanks for pulling it off! Two high-level comments: a) I would prefer to stick 
w/ the same pattern for metric initialization in Samza (i.e. use 
MetricsRegistry to create a metric object extending MetricsHelper and pass it 
into the constructor for SystemConsumer); b) I don't fully understand the use 
case that we need to use partial-ordered offset vector in the same partition as 
implemented in MultiFileHdfsReader. Let's discuss in person.


samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 (line 50)


It would be better to wrap this into the logic inside PartitionDescriptor 
class. Hence, no need to expose and leak out the delimiter used inside the 
PartitionDescriptor. Similar comments on offsets.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 (line 54)


It would be nicer to guarantee some sementic ordering among the files in 
the same partition, when persisting the partition descriptor.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 (line 59)


Not sure what are we doing here? What's the ordering that we are enforcing 
in this multi-file partition? I saw that you are trying to make the offsets as 
an offset vector on top of all files in the same partition. Why? Can we 
simplify it by making it full-ordered in the same partition instead of 
partial-ordered via an offset vector?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 (line 75)


Not sure why do we want to use the offset vector???



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 (line 90)


Again, why??? It is hard to understand without a concrete use case that 
requires a partial ordering of offsets in the same partition?



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala (line 
66)


It would be nicer to make it conforming to Offspring style of config 
variable scoping. i.e. if the scope of configuration is for hdfs consumer, use 
systems.%s.consumer.hdfs.buffer-capacity. I would suggest to consult Prateek 
since he has been working on the Offspring config refactoring. For new config 
variables, "." should strictly be used as deliminator between scopes, not as 
deliminator between words.



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala (line 
78)


Same here. systems.%s.partitioner.group-pattern.

Also, does this partitioner apply to other type of systems as well? Or just 
for HDFS? Won't it be better to name it systems.%s.hdfs-partitioner.* ?



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala (line 
86)


This is the redundant definition of a config variable from YarnConfig 
object. What's the intended relationship between HdfsConfig and YarnConfig? 
Besides, yarn.job.staging.directory is the metadata config path for the whole 
job. I assume that HdfsConfig is the metadata config path for a specific 
SystemAdmin? I would recommend that we have a per system config here. It may be 
default to YarnConfig.getJobStagingDirectory().



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
(line 32)


Why can't we create the HdfsSystemConsumerMetrics here and just passing in 
the HdfsSystemConsumerMetrics as we do w/ HdfsSystemProducer? Which use case 
makes it necessary to pass in the MetricsRegistry directly?



samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
(line 36)


Not related to your RB, but could you open a JIRA for this one? Using 
KafkaUtil class in HdfsSystemFactory seems really weird.



samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 (line 183)


It would be better to ensure that the index used in each partition is 
referring to the sorted list from inputFiles, to guarantee the consistent 
ordering/indexing of the same file in the list of input files. Is it possible 
that a new file is inserted in the middle of the sorted list? Maybe sort by 
create time to ensure the files are always appended to the greatest index in 
the s

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-13 Thread Hai Lu


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > Still in the middle but don't want to lose what I had up to now.

Also in the middle of addressing all the feedbacks. Have all the changes 
locally. Will push them altogether later. Thanks again for your review!


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java, 
> > line 47
> > 
> >
> > nit: when you refer to the class names, it would be better to use 
> > {@link HdfsSystemAdmin} {@link HdfsSystemConsumer} etc.

Will do.


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java, 
> > line 91
> > 
> >
> > You can do:
> > try(FSDataOutputStream fos = fs.create(targetPath)) {
> >   fos.write(PartitionDescriptionUtil.toJson();
> >   }

Do you just intend to narrow down the try block? the "getFileSystem" method 
above will also throw IOExecption, so I have to include everything here.


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java, 
> > line 105
> > 
> >
> > What if the PartitionDescriptor already exists? Could it be the case 
> > that the systemStreamMetadata maintains a different copy of 
> > PartitionDescriptor? It is not clear to me which one is the source of 
> > truth? directoryPartitioner.getPartitionMetadataMap()? Or 
> > directoryPartitioner.getPartitionDescriptor()? Maybe I miss some basic 
> > information regarding to the concept on PartitionDescriptor vs 
> > PartitionMetadataMap?

You are right. This is an unsolved problem given that we assume the HDFS folder 
is immutable. So now, what if the HDFS folder really is altered. Before we 
support the mutable HFDS input,  I think we have two options: 1. Throw an 
runtime exception to stop the job if we see inconsistency 2. Ignore the newer 
folder info we got and always treat the partition descriptors that we persisted 
on HDFS as the source of truth. Use the source of truth to proceed with the job 
and don't kill it or throw exception.
Do you have a preference?


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 69
> > 
> >
> > I would prefer to follow the same pattern as KafkaSystemConsumer, i.e. 
> > passing in the HdfsSystemConsumerMetrics object instead of the 
> > MetricsRegistry. Please check the code in KafkaSystemFactory.getConsumer() 
> > to see how the metrics object are created and passed along.

I can't avoid directly using MetricsRegistry. At the very least, I have to pass 
this to the base class: BlockingEnvelopeMap (unless you want to change all the 
interface in BlockingEnvelopMap as well. Event if we wan t to do that. It seems 
beyond the scope of this RB. I can maybe to a separte fix for that.). But I 
will create a HdfsSystemConsumerMetrics.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148612
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-12 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---



I am at HdfsReaderFactory. Will continue tomorrow.


samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 (line 56)


Is this the same map as in HdfsSystemConsumer.partitionDescriptionMap?? Can 
we make them the same name? And also add the annotation you put here to 
HdfsSystemConsumer as well?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 (line 58)


nit: since the input whiteList/blakcList are also regex, shouldn't we just 
name them the same?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 (line 77)


Make sure that you check-in to open source after we disable JDK7 build. 
This won't work for JDK7 build in open source.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 (line 83)


Question: this seems to be highly related to how the HDFS files are 
organized. It is hard to see how a common practice would look like, especially 
in open source. Can we make the groupIdentifier pluggable?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 (line 92)


nit: pateern ==> pattern



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 (line 162)


We had an on-going issue that the partitionId to the ssp mapping does not 
seem to be consistent between the job restarts. I suspect that might be a 
problem here as well, if the groupedPartitions list is not sorted in a 
consistent order?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 (line 174)


Now I see what this PartitionDescriptor really mean... Is it much more 
straightforward if renamed to partitionToFilePathsMap?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 (line 24)


One concern I had w/ this HdfsAvroFileReader/Writer is the version conflict 
issue. LinkedIn's Kafka version still uses avro-1.4 in the serde, while hdfs 
already uses avro-1.7 in 2.6.1. I guess that we need to find a solution inside 
LinkedIn to resolve it. Let's sync up face-to-face tomorrow.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 (line 56)


Is this offset the same offset that we persist into the checkpoint topic in 
Samza? If not, it would be really good to rename it to fileOffset explicitly to 
clearly differentiate the "offset" here and the "offset per ssp" in checkpoint 
topic.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 (line 104)


nit: please add some comments on why the key is null here.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 (line 152)


Ideally, shouldn't this class also include a avroFilePath variable to 
ensure that we never compare checkpoints for two different files?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 (line 159)


This is confusing. In the DirectoryPartitioner class, the offset can 
potentially be in the format of "0,0,0,0" for a single partition (with 
partition groups). How does it work here?

If we assume that "offset" here only refers to fileOffset, please clarify 
and discard this comment.


- Yi Pan (Data Infrastructure)


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachmen

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-12 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148612
---



Still in the middle but don't want to lose what I had up to now.


samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
(line 47)


nit: when you refer to the class names, it would be better to use {@link 
HdfsSystemAdmin} {@link HdfsSystemConsumer} etc.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
(line 84)


PartitionDescriptor or PartitionDescription? I saw both used in the 
high-level designs and the code. It would better to choose one. It seems that 
PartitionDescriptor is what you intended?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
(line 86)


So, the PartitionDescriptor is immutable? Better to make a note here or in 
javadoc of this variable.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
(line 91)


You can do:
try(FSDataOutputStream fos = fs.create(targetPath)) {
  fos.write(PartitionDescriptionUtil.toJson();
  }



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
(line 105)


What if the PartitionDescriptor already exists? Could it be the case that 
the systemStreamMetadata maintains a different copy of PartitionDescriptor? It 
is not clear to me which one is the source of truth? 
directoryPartitioner.getPartitionMetadataMap()? Or 
directoryPartitioner.getPartitionDescriptor()? Maybe I miss some basic 
information regarding to the concept on PartitionDescriptor vs 
PartitionMetadataMap?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
(line 130)


It would be nice to put an example offset string here to illustrate what we 
are comparing



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 50)


Would be nice to add some Javadoc for this class.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 59)


I would recommend to add some javadoc here to describe what's in the 
partitionDescriptionMap. This is one of the key concept in the design and would 
be nice to have Javadoc together w/ the code.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 60)


So, I assume that you would have one reader per partition? It would be nice 
to add into the javadoc here as well. Also, why the readers are only keyed by 
Partition, not the stream names?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 61)


nit: add javadoc here to explain what's the usage of isShutdown.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 64)


What's the usage of this one, for metrics? Can we use metrics directly? 
Check KafkaSystemConsumerMetrics for a reference implementation.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 69)


I would prefer to follow the same pattern as KafkaSystemConsumer, i.e. 
passing in the HdfsSystemConsumerMetrics object instead of the MetricsRegistry. 
Please check the code in KafkaSystemFactory.getConsumer() to see how the 
metrics object are created and passed along.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 75)


If you have reason not to use ConcurrentHashMap for readers and 
partitionDescriptionMap, please state so here. After the multi-thread change, 
we must make sure all new SystemConsumer and SystemProducer are thread-safe.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 79)


These should all be encapsulated in HdfsSystemConsumerMetrics object.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 101)


Better, to avoid the wasteful remote IO if there are multiple calls to 
getPartitionDescriptor from multiple threads, is to use bucketized locks on the 
ConcurrentHashMap entries to ensure synchronization in populating a certa

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-08 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Sept. 9, 2016, 1:34 a.m.)


Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description (updated)
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

The system factory is used by Samza to instantiate SystemConsumer, 
SystemProducer, and SystemAdmin for a specific system. The 
FileDataSystemFactory can be reused for other file system like sources. 

HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
HDFS files need to be consumed for this job. The DirectoryPartitioner also uses 
“GroupingPattern” to group files into partitions if advanced partitioning is 
required. HDFSSystemAdmin will then persist the “PartitionDescriptor” to HDFS.

The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
Based on this information as well as the actual assignment of partitions, it 
would then know which files to read from.

The initial implementation of the HDFS system consumer supports only avro data 
files. It’s very easy to extend it to a variety of file format by implementing 
the FileReader interface.


  
 
+--+
 
 |  
| 
   +-+ HDFS 
| 
   |   Obtain|  
| 
   |  Partition  
+--+--^--+-^---+
 
   | Description|  |  | 
| 
   ||  |  | 
| 
   |  +-v---+  |  |   
Filtering/| 
   |  | |  |  +---+
Grouping +-+   
   |  | HDFSAvroFileReader  |  |  | 
  |   
   |  | |Persist   |  | 
  |   
   |  +-+---+   Partition  |  | 
  |   
   ||  Description |   
+--v--+ +--+--+
   ||  |   |
 | | |
   |  +-+---+  |   |Directory 
Partitioner| |   HDFSAvroWriter|
   |  | IFileReader |  |   |
 | | |
   |  | |  |   
+--+--+ +--+--+
   |  +-+---+  |  | 
  |   
   ||  |  | 
  |   
   ||  |  | 
  |   
   |  +-+---+
+-+--++   +--+--+
   |  | || 
|   | |
   |  | HDFSSystemConsumer  ||   HDFSSystemAdmin   
|   | HDFSSystemProducer  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-08 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Sept. 9, 2016, 1:32 a.m.)


Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description (updated)
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

  
 
+--+
 
 |  
| 
   +-+ HDFS 
| 
   |   Obtain|  
| 
   |  Partition  
+--+--^--+-^---+
 
   | Description|  |  | 
| 
   ||  |  | 
| 
   |  +-v---+  |  |   
Filtering/| 
   |  | |  |  +---+
Grouping +-+   
   |  | HDFSAvroFileReader  |  |  | 
  |   
   |  | |Persist   |  | 
  |   
   |  +-+---+   Partition  |  | 
  |   
   ||  Description |   
+--v--+ +--+--+
   ||  |   |
 | | |
   |  +-+---+  |   |Directory 
Partitioner| |   HDFSAvroWriter|
   |  | IFileReader |  |   |
 | | |
   |  | |  |   
+--+--+ +--+--+
   |  +-+---+  |  | 
  |   
   ||  |  | 
  |   
   ||  |  | 
  |   
   |  +-+---+
+-+--++   +--+--+
   |  | || 
|   | |
   |  | HDFSSystemConsumer  ||   HDFSSystemAdmin   
|   | HDFSSystemProducer  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  
 |  
  
 
+---+--+
 
 |  
| 
 |  HDFSSystemFactory   
| 
 |  
| 
 
+--+


Diffs
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/H

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-08 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Sept. 9, 2016, 1:30 a.m.)


Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description (updated)
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

  
 

 
 ?  
? 
   ??? HDFS 
? 
   ?   Obtain?  
? 
   ?  Partition  

 
   ? Description?  ?  ? 
? 
   ??  ?  ? 
? 
   ?  ???  ?  ?   
Filtering/? 
   ?  ? ?  ?  ?
Grouping ???   
   ?  ? HDFSAvroFileReader  ?  ?  ? 
  ?   
   ?  ? ?Persist   ?  ? 
  ?   
   ?  ???   Partition  ?  ? 
  ?   
   ??  Description ?   
??? ???
   ??  ?   ?
 ? ? ?
   ?  ???  ?   ?Directory 
Partitioner? ?   HDFSAvroWriter?
   ?  ? IFileReader ?  ?   ?
 ? ? ?
   ?  ? ?  ?   
??? ???
   ?  ???  ?  ? 
  ?   
   ??  ?  ? 
  ?   
   ??  ?  ? 
  ?   
   ?  ???
???   ???
   ?  ? ?? 
?   ? ?
   ?  ? HDFSSystemConsumer  ??   HDFSSystemAdmin   
?   ? HDFSSystemProducer  ?
    ?? 
?   ? ?
  ???
???   ???
??  
  ?   

??? 
  
 ?  
  
 

 
 ?  
? 
 ?  HDFSSystemFactory   
? 
 ?  
? 
 



Diffs
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/H

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-07 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Sept. 7, 2016, 11:49 p.m.)


Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description (updated)
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf


Diffs
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing (updated)
---

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn 
cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-07 Thread Hai Lu


> On Sept. 1, 2016, 9:21 p.m., Navina Ramesh wrote:
> > @lhaiesp: Your patch looks awesome. Happy to review again once you have 
> > addressed the comments. It will be great if you can add some unit test for 
> > HdfsSystemConsumer. Some of the documentation that you will have to include 
> > for this feature will be:
> > * Add newly introduced configs to configuration-table.html
> > * Add newly introduced metrics to metrics-table.html (Pending SAMZA-702 
> > commit)
> > * Add a webpage for describing the behavior of HDFS systemconsumer (or more 
> > generically, consuming from Bounded input sources) and how to use the HDFS 
> > consumer
> > 
> > You can choose to keep the documentation as a part of this RB or create a 
> > follow-up JIRA for documentation and assign it to your self. Ideally, we 
> > don't want to have a lot of gap between code and documentation. 
> > 
> > Thanks for such a thorough work!

I do have a work item for documentation. I will take a look at the existing 
documentations.


> On Sept. 1, 2016, 9:21 p.m., Navina Ramesh wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 149
> > 
> >
> > Isn't numTotalEventsCounter a sum of all counters in numEventsCounter 
> > in the map ? Do we want to maintain a running sum?

It is. I think I was thinking about adding a config to enable/disable per 
partition metrics eventually. In that case a total metrics would be necessary.


> On Sept. 1, 2016, 9:21 p.m., Navina Ramesh wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 171
> > 
> >
> > Question: Is the generateOldestOffset simply returning a string of "0" 
> > delimited by a comma? The number of "0" matches the number of files in the 
> > group?

Yes.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review147596
---


On Sept. 7, 2016, 11:44 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 7, 2016, 11:44 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> 
> Diffs
> -
> 
>   build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
>   gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
>  PRE-CREATION 
>   samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
> 61b7570afae3219b618c8830905035063941bdd7 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
> 92eb4472533db67dca01f075cb460581b4bdac0d 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala
>  ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/s

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-07 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Sept. 7, 2016, 11:44 p.m.)


Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics


Diffs (updated)
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

tested by writing a real hdfs samza job and deploying to hadoop cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-06 Thread Hai Lu


> On Aug. 31, 2016, 7:25 a.m., Navina Ramesh wrote:
> > gradle/dependency-versions.gradle, line 39
> > 
> >
> > Why is this dependency introduced? Is it possible to get rid of this 
> > dependency ?

Yes. We can remove it here. I only added it because I found that the absence of 
this can cause unit test failure in the li_trunk branch. I guess we can do the 
fix there, instead.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review147414
---


On Aug. 29, 2016, 5:27 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Aug. 29, 2016, 5:27 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> 
> Diffs
> -
> 
>   build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
>   gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
> 92eb4472533db67dca01f075cb460581b4bdac0d 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala
>  ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
>  PRE-CREATION 
>   samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
>   samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
>   
> samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
>  261310d03de204718621f601117f016da14841df 
>   samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
> 4e328a5f8c2b496a71e36c106339b7af263c96c7 
> 
> Diff: https://reviews.apache.org/r/51142/diff/
> 
> 
> Testing
> ---
> 
> unit tests pass.
> 
> tested by writing a real hdfs samza job and deploying to hadoop cluster.
> 
> 
> Thanks,
> 
> Hai Lu
> 
>



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-01 Thread Navina Ramesh

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review147596
---



@lhaiesp: Your patch looks awesome. Happy to review again once you have 
addressed the comments. It will be great if you can add some unit test for 
HdfsSystemConsumer. Some of the documentation that you will have to include for 
this feature will be:
* Add newly introduced configs to configuration-table.html
* Add newly introduced metrics to metrics-table.html (Pending SAMZA-702 commit)
* Add a webpage for describing the behavior of HDFS systemconsumer (or more 
generically, consuming from Bounded input sources) and how to use the HDFS 
consumer

You can choose to keep the documentation as a part of this RB or create a 
follow-up JIRA for documentation and assign it to your self. Ideally, we don't 
want to have a lot of gap between code and documentation. 

Thanks for such a thorough work!


samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
(line 51)


nit: these config keys can be private or packages-specific



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
(line 54)


nit: typo in variable name "CONFOG"



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 149)


Isn't numTotalEventsCounter a sum of all counters in numEventsCounter in 
the map ? Do we want to maintain a running sum?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 (line 32)


nit: remove unused import



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 (line 38)


You can add a private default constructor to ensure that the class doesn't 
get instantiated.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 (line 171)


Question: Is the generateOldestOffset simply returning a string of "0" 
delimited by a comma? The number of "0" matches the number of files in the 
group?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 (line 39)


nit: assigned and ununsed
You can get rid of the constructor here as well.



samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 (line 164)


when using groupPattern, is it required for the name of the file represent 
the file length ? (in suffix string) Or did you just add it for testing?


- Navina Ramesh


On Aug. 29, 2016, 5:27 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Aug. 29, 2016, 5:27 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> 
> Diffs
> -
> 
>   build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
>   gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-08-31 Thread Navina Ramesh

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review147414
---



Still reviewing. Will continue tomorrow. Thanks!


gradle/dependency-versions.gradle (line 39)


Why is this dependency introduced? Is it possible to get rid of this 
dependency ?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
(line 64)


nit: Unused variable



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 53)


Can you add some javadocs for this config?
Ideally, we want all configs to be wrapped in a HdfsConfig object , that 
provides appropriate config accessors and default values.



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
(line 58)


I understand what staging directory is because I recently read the design 
doc. Can you please add some javadoc about what this staging directory is used 
for?



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 (line 25)


Please add some javadocs for interfaces



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 (line 135)


Very nice and useful javadoc!



samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 (line 53)


If it is not implemented, I suggest removing it and simply adding a comment 
that another potential readertype could be "PLAIN" text


- Navina Ramesh


On Aug. 29, 2016, 5:27 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Aug. 29, 2016, 5:27 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> 
> Diffs
> -
> 
>   build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
>   gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
> 92eb4472533db67dca01f075cb460581b4bdac0d 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala
>  ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
>  PRE-CREATION 
>   samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
>   samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
>   
> samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
>  261310d03de204718621f601117f016da14841df 
>   samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
> 4e328a5f8c2b496a71e36c106339b7af263c96c7

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-08-28 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Aug. 29, 2016, 5:52 a.m.)


Review request for samza.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics


Diffs (updated)
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

tested by writing a real hdfs samza job and deploying to hadoop cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-08-23 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Aug. 23, 2016, 11:20 p.m.)


Review request for samza.


Changes
---

System consumer


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description (updated)
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics


Diffs (updated)
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionJsonUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

tested by writing a real hdfs samza job and deploying to hadoop cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-08-18 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Aug. 18, 2016, 6:05 p.m.)


Review request for samza.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description (updated)
---

Add HDFS System Consumer: 

1. System admin, partitioner (for review)
2. System consumer with metrics (in progress)


Diffs (updated)
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionJsonUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemAdmin.java 
PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionJsonUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

tested by writing a real hdfs samza job and deploying to hadoop cluster.


Thanks,

Hai Lu



Review Request 51142: SAMZA-967: HDFS System Consumer

2016-08-16 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

Review request for samza.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer:

Check-in is divided into two parts:

1. System admin, partiioner
2. System consumer with metrics


Diffs
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionJsonUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionJsonUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

tested by writing a real hdfs samza job and deploying to hadoop cluster.


Thanks,

Hai Lu