Re: Using NOT NULL in a Pig FILTER statement.

Chandeep Singh Fri, 19 Feb 2016 09:31:11 -0800

Yes, I did filter using the same conditions you’ve mentioned. I tested it 
earlier with comma as the delimiter (previous email has logs) and now with ^A.


[csingh~]$ cat -v test.txt
1^A2^A76
1^A^A^A76
^A2^A^A76
1^A1^A2^A
1^A1^A1^A76
1^A2^A1^A76

grunt> D = LOAD 'test.txt' USING PigStorage('\\u001') AS (IS_REPORTED:INT, 
PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
grunt> DUMP D;
(1,2,76,)
(1,,,76)
(,2,,76)
(1,1,2,)
(1,1,1,76)
(1,2,1,76)

grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is 
not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND 
(PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);

grunt> DUMP X;
(1,2,1,76)


So, the filter for NULL’s is working as you can see when I dump after filtering.

> On Feb 19, 2016, at 12:13 AM, Parth Sawant <[email protected]> wrote:
> 
> Did you put a Filter on the values to remove the null? I'm trying to filter
> the NULL values using the Pig Filter Keyword and then use the Phoenix Pig
> integration to store the data. I have '\\u001' <smb://u001'> as the delimiter 
> for
> multiple files. It is supported by Pig BulkLoader too.
> 
> Snippet:
> 
> D = LOAD 'src_dest' using PigStorage('\\u001' <smb://u001'>) as AS 
> (IS_REPORTED:INT,
> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
> 
> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not
> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> 
> On Thu, Feb 18, 2016 at 3:06 PM, Chandeep Singh <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> So, I added one record to your sample to match all the conditions you have
>> in your filter statement.
>> 
>> New input:
>> [csingh]$ hadoop fs -cat test.txt
>> 1,,2,76
>> 1,,,76
>> ,2,,76
>> 1,1,2,
>> 1,1,1,76
>> 1,2,1,76
>> 
>> I modified the load statement to use PigStorage delimited by comma.
>> 
>> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>> 
>> Output:
>> (1,2,1,76)
>> 
>> So, the NOT NULL's seem to be working.
>> 
>> Pig Log’s:
>> 
>> grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>> grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID
>> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>> grunt> DUMP X;
>> 2016-02-18 23:01:06,336 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>> script: FILTER
>> 2016-02-18 23:01:06,366 [main] INFO
>> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
>> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
>> DuplicateForEachColumnRewrite, GroupByConstParallelSetter,
>> ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter,
>> MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten,
>> PushUpFilter, SplitFilter, StreamTypeCastInserter],
>> RULES_DISABLED=[FilterLogicExpressionSimplifier, PartitionFilterOptimizer]}
>> 2016-02-18 23:01:06,480 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> - MR plan size before optimization: 1
>> 2016-02-18 23:01:10,798 [JobControl] INFO
>> org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
>> deprecated. Instead, use fs.defaultFS
>> 2016-02-18 23:01:11,345 [JobControl] INFO
>> org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job:
>> job_1454499131434_9884
>> 2016-02-18 23:01:11,542 [JobControl] INFO
>> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted
>> application application_1454499131434_9884
>> 2016-02-18 23:01:11,597 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - 0% complete
>> 2016-02-18 23:01:31,393 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - 50% complete
>> 2016-02-18 23:01:36,818 [main] INFO
>> org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is
>> deprecated. Instead, use mapreduce.job.reduces
>> 2016-02-18 23:01:36,875 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - 100% complete
>> 2016-02-18 23:01:36,878 [main] INFO
>> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>> 
>> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>> Features
>> 2.6.0-cdh5.4.8  0.12.0-cdh5.4.8 csingh  2016-02-18 23:01:06     2016-02-18
>> 23:01:36     FILTER
>> 
>> Success!
>> 
>> Job Stats (time in seconds):
>> JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime
>> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime
>> MedianReducetime        Alias   Feature Outputs
>> job_1454499131434_9884  1       0       8       8       8       8
>> n/a     n/a     n/a     n/a     D,X     MAP_ONLY
>> 
>> Input(s):
>> Successfully read 6 records (418 bytes) from:
>> 
>> Output(s):
>> Successfully stored 1 records (10 bytes) in:
>> 
>> Counters:
>> Total records written : 1
>> Total bytes written : 10
>> Spillable Memory Manager spill count : 0
>> Total bags proactively spilled: 0
>> Total records proactively spilled: 0
>> 
>> Job DAG:
>> job_1454499131434_9884
>> 
>> 2016-02-18 23:01:36,976 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> - Success!
>> 2016-02-18 23:01:36,992 [main] INFO
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
>> to process : 1
>> 2016-02-18 23:01:36,993 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
>> paths to process : 1
>> (1,2,1,76)
>> 
>> 
>> 
>>> On Feb 18, 2016, at 10:13 PM, Parth Sawant <[email protected]>
>> wrote:
>>> 
>>> Attaching a sample input. Basically 5 rows with only 4 Integer values in
>> each. Some are NULL values.
>>> 
>>> Thanks.
>>> 
>>> On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <[email protected]
>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>> I’m just looking for one sample record (which has NULL's) and not the
>> entire input so that its easier for me to debug.
>>> 
>>>> On Feb 18, 2016, at 9:40 PM, Parth Sawant <[email protected] 
>>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>> 
>>>> The input is simply too large to relay to others. A simplified schema
>> is
>>>> below. I only have INT columns with some null values in them. This is
>> my
>>>> Pig code snippet:
>>>> 
>>>> D= LOAD 'src_locatn' as
>>>> IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
>>>> AFFINITY_GROUP_ID:INT;
>>>> 
>>>> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
>> not
>>>> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>>>> 
>>>> Thanks
>>>> 
>>>> On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <[email protected] 
>>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>> 
>>>>> Any chance you could share a sample record which has NULL’s in it? as
>> well
>>>>> as your pig script?
>>>>> 
>>>>>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <[email protected] 
>>>>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>>
>>>>> wrote:
>>>>>> 
>>>>>> I had anticipated it would throw a similar error with this
>> suggestion as
>>>>>> the last one... and it did. My fields are declared as INT, just to
>>>>>> re-iterate. I don't think they can be compared to regexes. Here is
>> the
>>>>>> error:
>>>>>> 
>>>>>> ERROR 1037:
>>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
>>>>>> CharArray only :(Name: Regex Type: null Uid: null)
>>>>>> 
>>>>>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException:
>> ERROR
>>>>> 1037:
>>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
>>>>>> CharArray only :(Name: Regex Type: null Uid: null)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> 
>>>>>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <[email protected] 
>>>>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>> 
>>>>>>> Since you integers in this field can you try matching to a regular
>>>>>>> expression?
>>>>>>> 
>>>>>>> Something like: X matches '\\d+' <smb://d+'>
>>>>>>> 
>>>>>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <
>> [email protected] <mailto:[email protected]> 
>> <mailto:[email protected] <mailto:[email protected]>>>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Chandeep. I tried that already but it gave me the following
>> error:
>>>>>>>> 
>>>>>>>> ERROR 1039:
>>>>>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
>>>>>>>> types in NotEqual Operator left hand side:int right hand
>>>>>>>> side:chararray.
>>>>>>>> 
>>>>>>>> The error makes sense cause the fields I have are INT type and
>> hence
>>>>>>>> cannot be compared to a chararray.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks for the prompt response though.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <[email protected] 
>>>>>>>> <mailto:[email protected]> <mailto:
>> [email protected] <mailto:[email protected]>>> wrote:
>>>>>>>> 
>>>>>>>> Try adding != '' along with IS NOT NULL.
>>>>>>>>> 
>>>>>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <
>> [email protected] <mailto:[email protected]> 
>> <mailto:[email protected] <mailto:[email protected]>>
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL'
>> .
>>>>> For
>>>>>>>>> some
>>>>>>>>>> reason the null data values persist.
>>>>>>>>>> For eg: the following filter on storing it's contents, contains
>> null
>>>>>>>>> values
>>>>>>>>>> for ABC and PQR.
>>>>>>>>>> 
>>>>>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND
>> (PQR
>>>>> IS
>>>>>>>>> NOT
>>>>>>>>>> NULL) ;
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Can someone help with this?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>> Parth S
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>> <Sample_in.txt>

Re: Using NOT NULL in a Pig FILTER statement.

Reply via email to