Re: Question about reading ORC file in Flink

2019-09-25 Thread Fabian Hueske
Thank you very much for coming back and reporting the good news! :-)
If you think that there is something that we can do to improve Flink's ORC
input format, for example log a warning, please open a Jira.

Thank you,
Fabian

Am Mi., 25. Sept. 2019 um 05:14 Uhr schrieb 163 :

> Hi Fabian,
>
> After debugging in local mode, I found that Flink orc connector is no
> problem, but some fields in our schema is in capital form,so these fields
> can not be matched.
> But the program directly read orc file using includeColumns method, which
> will use equalsIgnoreCase to match the column, so it can read the fields.
>
> Thanks for your Help!
>
> Qi Shu
>
>
> 在 2019年9月24日,下午4:36,Fabian Hueske  写道:
>
> Hi QiShu,
>
> It might be that Flink's OrcInputFormat has a bug.
> Can you open a Jira issue to report the problem?
> In order to be able to fix this, we need as much information as possible.
> It would be great if you could create a minimal example of an ORC file and
> a program that reproduces the issue.
> If that's not possible, we need the schema of an Orc file that cannot be
> correctly read.
>
> Thanks,
> Fabian
>
> Am Mo., 23. Sept. 2019 um 11:40 Uhr schrieb ShuQi :
>
>> Hi Guys,
>>
>> The Flink version is 1.9.0. I use OrcTableSource to read ORC file in HDFS
>> and the job is executed successfully, no any exception or error. But some
>> fields(such as tagIndustry) are always null, actually these fields are not
>> null. I can read these fields by direct reading it. Below is my code:
>>
>> //main
>>  final ParameterTool params = ParameterTool.fromArgs(args);
>>
>> final ExecutionEnvironment env =
>> ExecutionEnvironment.getExecutionEnvironment();
>>
>> env.getConfig().setGlobalJobParameters(params);
>>
>> Configuration config = new Configuration();
>>
>>
>> OrcTableSource orcTableSource = OrcTableSource
>> .builder()
>> .path(params.get("input"))
>> .forOrcSchema(TypeDescription.fromString(TYPE_INFO))
>> .withConfiguration(config)
>> .build();
>>
>> DataSet dataSet = orcTableSource.getDataSet(env);
>>
>> DataSet> counts = dataSet.flatMap(new
>> Tokenizer()).groupBy(0).sum(1);
>>
>> //read field
>> public void flatMap(Row row, Collector> out) {
>>
>> String content = ((String) row.getField(6));
>> String tagIndustry = ((String) row.getField(35));
>>
>> LOGGER.info("arity: " + row.getArity());
>> LOGGER.info("content: " + content);
>> LOGGER.info("tagIndustry: " + tagIndustry);
>> LOGGER.info("===");
>>
>> if (Strings.isNullOrEmpty(content) ||
>> Strings.isNullOrEmpty(tagIndustry) || !tagIndustry.contains("FD001")) {
>> return;
>> }
>> // normalize and split the line
>> String[] tokens = content.toLowerCase().split("\\W+");
>>
>> // emit the pairs
>> for (String token : tokens) {
>> if (token.length() > 0) {
>> out.collect(new Tuple2<>(token, 1));
>> }
>> }
>> }
>>
>> Thanks for your help!
>>
>> QiShu
>>
>>
>>
>>
>>
>
>


Re: Question about reading ORC file in Flink

2019-09-24 Thread 163
Hi Fabian,

After debugging in local mode, I found that Flink orc connector is no problem, 
but some fields in our schema is in capital form,so these fields can not be 
matched.
But the program directly read orc file using includeColumns method, which will 
use equalsIgnoreCase to match the column, so it can read the fields.

Thanks for your Help!

Qi Shu


> 在 2019年9月24日,下午4:36,Fabian Hueske  写道:
> 
> Hi QiShu,
> 
> It might be that Flink's OrcInputFormat has a bug.
> Can you open a Jira issue to report the problem? 
> In order to be able to fix this, we need as much information as possible.
> It would be great if you could create a minimal example of an ORC file and a 
> program that reproduces the issue.
> If that's not possible, we need the schema of an Orc file that cannot be 
> correctly read.
> 
> Thanks,
> Fabian
> 
> Am Mo., 23. Sept. 2019 um 11:40 Uhr schrieb ShuQi  >:
> Hi Guys,
> 
> The Flink version is 1.9.0. I use OrcTableSource to read ORC file in HDFS and 
> the job is executed successfully, no any exception or error. But some 
> fields(such as tagIndustry) are always null, actually these fields are not 
> null. I can read these fields by direct reading it. Below is my code:
> 
> //main
>  final ParameterTool params = ParameterTool.fromArgs(args);
> 
> final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment();
> 
> env.getConfig().setGlobalJobParameters(params);
> 
> Configuration config = new Configuration();
> 
> 
> OrcTableSource orcTableSource = OrcTableSource
> .builder()
> .path(params.get("input"))
> .forOrcSchema(TypeDescription.fromString(TYPE_INFO))
> .withConfiguration(config)
> .build();
> 
> DataSet dataSet = orcTableSource.getDataSet(env);
> 
> DataSet> counts = dataSet.flatMap(new 
> Tokenizer()).groupBy(0).sum(1);
> 
> //read field
> public void flatMap(Row row, Collector> out) {
> 
> String content = ((String) row.getField(6));
> String tagIndustry = ((String) row.getField(35));
> 
> LOGGER.info("arity: " + row.getArity());
> LOGGER.info("content: " + content);
> LOGGER.info("tagIndustry: " + tagIndustry);
> LOGGER.info("===");
> 
> if (Strings.isNullOrEmpty(content) || 
> Strings.isNullOrEmpty(tagIndustry) || !tagIndustry.contains("FD001")) {
> return;
> }
> // normalize and split the line
> String[] tokens = content.toLowerCase().split("\\W+");
> 
> // emit the pairs
> for (String token : tokens) {
> if (token.length() > 0) {
> out.collect(new Tuple2<>(token, 1));
> }
> }
> }
> 
> Thanks for your help!
> 
> QiShu
> 
> 
> 
>  



Re: Question about reading ORC file in Flink

2019-09-24 Thread Fabian Hueske
Hi QiShu,

It might be that Flink's OrcInputFormat has a bug.
Can you open a Jira issue to report the problem?
In order to be able to fix this, we need as much information as possible.
It would be great if you could create a minimal example of an ORC file and
a program that reproduces the issue.
If that's not possible, we need the schema of an Orc file that cannot be
correctly read.

Thanks,
Fabian

Am Mo., 23. Sept. 2019 um 11:40 Uhr schrieb ShuQi :

> Hi Guys,
>
> The Flink version is 1.9.0. I use OrcTableSource to read ORC file in HDFS
> and the job is executed successfully, no any exception or error. But some
> fields(such as tagIndustry) are always null, actually these fields are not
> null. I can read these fields by direct reading it. Below is my code:
>
> //main
>  final ParameterTool params = ParameterTool.fromArgs(args);
>
> final ExecutionEnvironment env =
> ExecutionEnvironment.getExecutionEnvironment();
>
> env.getConfig().setGlobalJobParameters(params);
>
> Configuration config = new Configuration();
>
>
> OrcTableSource orcTableSource = OrcTableSource
> .builder()
> .path(params.get("input"))
> .forOrcSchema(TypeDescription.fromString(TYPE_INFO))
> .withConfiguration(config)
> .build();
>
> DataSet dataSet = orcTableSource.getDataSet(env);
>
> DataSet> counts = dataSet.flatMap(new
> Tokenizer()).groupBy(0).sum(1);
>
> //read field
> public void flatMap(Row row, Collector> out) {
>
> String content = ((String) row.getField(6));
> String tagIndustry = ((String) row.getField(35));
>
> LOGGER.info("arity: " + row.getArity());
> LOGGER.info("content: " + content);
> LOGGER.info("tagIndustry: " + tagIndustry);
> LOGGER.info("===");
>
> if (Strings.isNullOrEmpty(content) ||
> Strings.isNullOrEmpty(tagIndustry) || !tagIndustry.contains("FD001")) {
> return;
> }
> // normalize and split the line
> String[] tokens = content.toLowerCase().split("\\W+");
>
> // emit the pairs
> for (String token : tokens) {
> if (token.length() > 0) {
> out.collect(new Tuple2<>(token, 1));
> }
> }
> }
>
> Thanks for your help!
>
> QiShu
>
>
>
>
>


Question about reading ORC file in Flink

2019-09-23 Thread ShuQi
Hi Guys,


The Flink version is 1.9.0. I use OrcTableSource to read ORC file in HDFS and 
the job is executed successfully, no any exception or error. But some 
fields(such as tagIndustry) are always null, actually these fields are not 
null. I can read these fields by direct reading it. Below is my code:


//main
 final ParameterTool params = ParameterTool.fromArgs(args);


final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();


env.getConfig().setGlobalJobParameters(params);


Configuration config = new Configuration();




OrcTableSource orcTableSource = OrcTableSource
.builder()
.path(params.get("input"))
.forOrcSchema(TypeDescription.fromString(TYPE_INFO))
.withConfiguration(config)
.build();


DataSet dataSet = orcTableSource.getDataSet(env);


DataSet> counts = dataSet.flatMap(new 
Tokenizer()).groupBy(0).sum(1);


//read field
public void flatMap(Row row, Collector> out) {


String content = ((String) row.getField(6));
String tagIndustry = ((String) row.getField(35));


LOGGER.info("arity: " + row.getArity());
LOGGER.info("content: " + content);
LOGGER.info("tagIndustry: " + tagIndustry);
LOGGER.info("===");


if (Strings.isNullOrEmpty(content) || 
Strings.isNullOrEmpty(tagIndustry) || !tagIndustry.contains("FD001")) {
return;
}
// normalize and split the line
String[] tokens = content.toLowerCase().split("\\W+");


// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}


Thanks for your help!


QiShu