[jira] Subscription: PIG patch available

2016-02-29 Thread jira
Issue Subscription
Filter: PIG patch available (32 issues)

Subscriber: pigdaily

Key Summary
PIG-4818Single quote inside comment in GENERATE is not being ignored
https://issues.apache.org/jira/browse/PIG-4818
PIG-4816Read a null scalar causing a Tez failure
https://issues.apache.org/jira/browse/PIG-4816
PIG-4796Authenticate with Kerberos using a keytab file
https://issues.apache.org/jira/browse/PIG-4796
PIG-4788the value BytesRead metric info always returns 0 even the length of 
input file is not 0 in spark engine
https://issues.apache.org/jira/browse/PIG-4788
PIG-4745DataBag should protect content of passed list of tuples
https://issues.apache.org/jira/browse/PIG-4745
PIG-4734TOMAP schema inferring breaks some scripts in type checking for 
bincond
https://issues.apache.org/jira/browse/PIG-4734
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4641Print the instance of Object without using toString()
https://issues.apache.org/jira/browse/PIG-4641
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4581thread safe issue in NodeIdGenerator
https://issues.apache.org/jira/browse/PIG-4581
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4526Make setting up the build environment easier
https://issues.apache.org/jira/browse/PIG-4526
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4455Should use DependencyOrderWalker instead of DepthFirstWalker in 
MRPrinter
https://issues.apache.org/jira/browse/PIG-4455
PIG-4341Add CMX support to pig.tmpfilecompression.codec
https://issues.apache.org/jira/browse/PIG-4341
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4111Make Pig compiles with avro-1.7.7
https://issues.apache.org/jira/browse/PIG-4111
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3906ant site errors out
https://issues.apache.org/jira/browse/PIG-3906
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3866Create ThreadLocal classloader per PigContext
https://issues.apache.org/jira/browse/PIG-3866
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384


Re: Review Request 43571: PIG-4788:the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine

2016-02-29 Thread kelly zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43571/
---

(Updated March 1, 2016, 4:51 a.m.)


Review request for pig, Xianda Ke, Mohit Sabharwal, Pallavi Rao, and Xuefu 
Zhang.


Changes
---

The patch modifies the PigSplit. PigSplit extends FileSplit do not extend 
InputSplit. I explain the detail reason in the jira 
page(https://issues.apache.org/jira/browse/PIG-4788).  Hope more pig developers 
join in the discussion.


Bugs: PIG-4788
https://issues.apache.org/jira/browse/PIG-4788


Repository: pig-git


Description (updated)
---

I explained more detailed about the modification on jira page.
In PIG-4788.patch:
changes are
1. PigSplit extends FileSplit not InputSplit
2. add try catch to PigSplit#getLocations(), PigSplit#getLength()
3.add PigSplit#getPath(). PigSplit#getPath() will be called in 
NewTrackingRecordReader.


Diffs
-

  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigSplit.java 
92c6bd6 
  src/org/apache/pig/builtin/mock/Storage.java afc1d29 

Diff: https://reviews.apache.org/r/43571/diff/


Testing
---

After test, no new unit tests are imported and unit test failures about 
TestOrcStoragePushdown will be fixed.


Thanks,

kelly zhang



[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine

2016-02-29 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173203#comment-15173203
 ] 

liyunzhang_intel commented on PIG-4788:
---

[~mohitsabharwal]: it says that:
{quote}
A LoadFunc loads data into Pig. It can read from an HDFS file or other source.  
{quote}

I do not know the other source is a file source,maybe we need ask others like 
[~rohini].

> the value BytesRead metric info always returns 0 even the length of input 
> file is not 0 in spark engine
> ---
>
> Key: PIG-4788
> URL: https://issues.apache.org/jira/browse/PIG-4788
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4788.patch
>
>
> In 
> [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140],
>  taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the 
> length of input file is not zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine

2016-02-29 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173194#comment-15173194
 ] 

Mohit Sabharwal commented on PIG-4788:
--

Custom LoadFunc for example can be written to return any inputFormat , no ? 
https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/LoadFunc.html

> the value BytesRead metric info always returns 0 even the length of input 
> file is not 0 in spark engine
> ---
>
> Key: PIG-4788
> URL: https://issues.apache.org/jira/browse/PIG-4788
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4788.patch
>
>
> In 
> [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140],
>  taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the 
> length of input file is not zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine

2016-02-29 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173191#comment-15173191
 ] 

liyunzhang_intel commented on PIG-4788:
---

[~mohitsabharwal]: most of pig input are file format. can you point out the 
non-file input? 
No unit test failure are imported after PigSplit extends FileSplit(I can not 
guarantee unit test can cover all ). I also hope this change not influence 
other features.

> the value BytesRead metric info always returns 0 even the length of input 
> file is not 0 in spark engine
> ---
>
> Key: PIG-4788
> URL: https://issues.apache.org/jira/browse/PIG-4788
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4788.patch
>
>
> In 
> [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140],
>  taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the 
> length of input file is not zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine

2016-02-29 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173189#comment-15173189
 ] 

Mohit Sabharwal commented on PIG-4788:
--

Ah, of course, sorry - FileSplit can't be replaced by PigSplit.

My other concern was whether changing PigSplit to extend FileSplit will break 
PigSplit for inputformats that use non-File splits. Makes sense ?   

> the value BytesRead metric info always returns 0 even the length of input 
> file is not 0 in spark engine
> ---
>
> Key: PIG-4788
> URL: https://issues.apache.org/jira/browse/PIG-4788
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4788.patch
>
>
> In 
> [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140],
>  taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the 
> length of input file is not zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine

2016-02-29 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173175#comment-15173175
 ] 

liyunzhang_intel commented on PIG-4788:
---

[~mohitsabharwal]: PigSplit is an importance class which is related with many 
other class.
So i guess we can not use FileSplit to replace PigSplit in PigInputFormatSpark 
although i have not done that experiment.
Before Pallavi suggested to copy PigSplit to PigSplitSpark which extends 
FileSplit and use PigSplitSpark in PigInpitFormatSpark, i found it was 
difficult to do that when i did the experiment  because i need copy a lot of 
code.

> the value BytesRead metric info always returns 0 even the length of input 
> file is not 0 in spark engine
> ---
>
> Key: PIG-4788
> URL: https://issues.apache.org/jira/browse/PIG-4788
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4788.patch
>
>
> In 
> [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140],
>  taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the 
> length of input file is not zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine

2016-02-29 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173164#comment-15173164
 ] 

Mohit Sabharwal commented on PIG-4788:
--

[~kellyzly], if you change {{PigSplit}} to extend {{FileSplit}}, will 
{{PigInputFormat}} still work with non-file splits like CombineFileSplit, etc. ?

Can we instead use {{FileSplit}} when we create the record reader in 
{{PigInputFormatSpark}}, instead of {{PigSplit}} ? That way we could isolate 
the change in Spark specific code.  

> the value BytesRead metric info always returns 0 even the length of input 
> file is not 0 in spark engine
> ---
>
> Key: PIG-4788
> URL: https://issues.apache.org/jira/browse/PIG-4788
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4788.patch
>
>
> In 
> [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140],
>  taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the 
> length of input file is not zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4796) Authenticate with Kerberos using a keytab file

2016-02-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172798#comment-15172798
 ] 

Daniel Dai commented on PIG-4796:
-

Looks good to me and thanks for the docs. [~rohini], do you have any comments 
since you are more familiar with this part.

> Authenticate with Kerberos using a keytab file
> --
>
> Key: PIG-4796
> URL: https://issues.apache.org/jira/browse/PIG-4796
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.15.0
>Reporter: Niels Basjes
>Assignee: Niels Basjes
>  Labels: feature, kerberos, security
> Attachments: 2016-02-18-1510-PIG-4796.patch, 
> 2016-02-18-PIG-4796-rough-proof-of-concept.patch, PIG-4796-2016-02-23.patch
>
>
> When running in a Kerberos secured environment users are faced with the 
> limitation that their jobs cannot run longer than the (remaining) ticket 
> lifetime of their Kerberos tickets. The environment I work in these tickets 
> expire after 10 hours, thus limiting the maximum job duration to at most 10 
> hours (which is a problem).
> In the Hadoop tooling there is a feature where you can authenticate using a 
> Kerberos keytab file (essentially a file that contains the encrypted form of 
> the kerberos principal and password). Using this the running application can 
> request new tickets from the Kerberos server when the initial tickets expire.
> In my Java/Hadoop applications I commonly include these two lines:
> {code}
> System.setProperty("java.security.krb5.conf", "/etc/krb5.conf");
> UserGroupInformation.loginUserFromKeytab("nbas...@xx.net", 
> "/home/nbasjes/.krb/nbasjes.keytab");
> {code}
> This way I have run an Apache Flink based application for more than 170 hours 
> (about a week) on the kerberos secured Yarn cluster.
> What I propose is to have a feature that I can set the relevant kerberos 
> values in my pig script and from there be able to run a pig job for many days 
> on the secured cluster.
> Proposal how this can look in a pig script:
> {code}
> SET java.security.krb5.conf '/etc/krb5.conf'
> SET job.security.krb5.principal 'nbas...@xx.net'
> SET job.security.krb5.keytab '/home/nbasjes/.krb/nbasjes.keytab'
> {code}
> So iff all of these are set (or at least the last two) then the 
> aforementioned  UserGroupInformation.loginUserFromKeytab method is called 
> before submitting the job to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4817) Bump HTTP Logparser to version 2.4

2016-02-29 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4817:

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 0.16.0
   Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks Niels!

> Bump HTTP Logparser to version 2.4
> --
>
> Key: PIG-4817
> URL: https://issues.apache.org/jira/browse/PIG-4817
> Project: Pig
>  Issue Type: Improvement
>Reporter: Niels Basjes
>Assignee: Niels Basjes
> Fix For: 0.16.0
>
>     Attachments: PIG-4817-20160229.patch
>
>
> Main reason for the update is this fix:
> Now support parsing the first line even if it is chopped by Apache httpd 
> because of an URI longer than 8000 bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4818) Single quote inside comment in GENERATE is not being ignored

2016-02-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172789#comment-15172789
 ] 

Daniel Dai commented on PIG-4818:
-

+1.

And this happens only when the comment is in GENERATE clause.

> Single quote inside comment in GENERATE is not being ignored
> 
>
> Key: PIG-4818
> URL: https://issues.apache.org/jira/browse/PIG-4818
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.15.0
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-4818-v01.patch
>
>
> {code}
> A = load '1.txt' as (a1:int, a2:int);
> B = FOREACH A GENERATE a1,
>  -- testing ' here with single quote
>   a2;
> dump B;
> {code}
> This fails with 
> {panel}
> 2016-02-29 20:09:05,507 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Lexical error at line 6, column 0.  Encountered: 
>  after : ""
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Request: Feedback on patches

2016-02-29 Thread Niels Basjes
Hi,

A friendly request for your feedback on these patches I provided the last
few days.

Thanks.

Authenticate with Kerberos using a keytab file
https://issues.apache.org/jira/browse/PIG-4796
This is a feature to allow running jobs on secure clusters for a duration
longer than the maximum lifetime of the kerberos tickets.



Bump HTTP Logparser to version 2.4
https://issues.apache.org/jira/browse/PIG-4817
Simply update to the most recent version that allows parsing a few edge
cases more.
Specifically corrupted loglines as seen when HTTP 414 occurs (URI too long)
which I have on a daily basis in our logfiles.



Make setting up the build environment easier
https://issues.apache.org/jira/browse/PIG-4526
A set of scripts to create a reproducable docker based build environment.
This is intended to get a 'very reproducable' way of setting up the build
environment to reproduce bugs and to run builds.



ant site errors out
https://issues.apache.org/jira/browse/PIG-3906
I submitted a rather brutal fix: Disable the problematic feature. Question:
Is this fix a desirable one?



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


[jira] [Updated] (PIG-4818) Single quote inside comment in GENERATE is not being ignored

2016-02-29 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-4818:
--
Attachment: pig-4818-v01.patch

It seems like a minor change in PIG-2507 caused this unexpected behavior.  Not 
understanding javacc much, but pasting some changes that seem to work.  
Appreciate if someone can review the change.



> Single quote inside comment in GENERATE is not being ignored
> 
>
> Key: PIG-4818
> URL: https://issues.apache.org/jira/browse/PIG-4818
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.15.0
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-4818-v01.patch
>
>
> {code}
> A = load '1.txt' as (a1:int, a2:int);
> B = FOREACH A GENERATE a1,
>  -- testing ' here with single quote
>   a2;
> dump B;
> {code}
> This fails with 
> {panel}
> 2016-02-29 20:09:05,507 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Lexical error at line 6, column 0.  Encountered: 
>  after : ""
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4818) Single quote inside comment in GENERATE is not being ignored

2016-02-29 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-4818:
--
Status: Patch Available  (was: Open)

> Single quote inside comment in GENERATE is not being ignored
> 
>
> Key: PIG-4818
> URL: https://issues.apache.org/jira/browse/PIG-4818
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.15.0, 0.14.0, 0.13.0, 0.12.1, 0.12.0
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-4818-v01.patch
>
>
> {code}
> A = load '1.txt' as (a1:int, a2:int);
> B = FOREACH A GENERATE a1,
>  -- testing ' here with single quote
>   a2;
> dump B;
> {code}
> This fails with 
> {panel}
> 2016-02-29 20:09:05,507 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Lexical error at line 6, column 0.  Encountered: 
>  after : ""
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4818) Single quote inside comment in GENERATE is not being ignored

2016-02-29 Thread Koji Noguchi (JIRA)
Koji Noguchi created PIG-4818:
-

 Summary: Single quote inside comment in GENERATE is not being 
ignored
 Key: PIG-4818
 URL: https://issues.apache.org/jira/browse/PIG-4818
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.15.0, 0.14.0, 0.13.0, 0.12.1, 0.12.0
Reporter: Koji Noguchi
Assignee: Koji Noguchi
Priority: Minor


{code}
A = load '1.txt' as (a1:int, a2:int);
B = FOREACH A GENERATE a1,
 -- testing ' here with single quote
  a2;
dump B;
{code}

This fails with 
{panel}
2016-02-29 20:09:05,507 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Lexical error at line 6, column 0.  Encountered: 
 after : ""
{panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4796) Authenticate with Kerberos using a keytab file

2016-02-29 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171950#comment-15171950
 ] 

Niels Basjes commented on PIG-4796:
---

I just did a full size test run using the following script on 10 days worth of 
click data. 
Summary: Test passed on the Kerberos secured cluster I have here.


My input were 90 distinct logfiles totaling a few hundred GiB of gzipped apache 
access logfiles.
My kerberos account has been configured to have the tickets expire after 5 
minutes and have a max renew of 10 minutes (for me this is the easiest way to 
test this feature).

I ran this pig script with the following command line:
{code}kdestroy
./bin/pig -P nbasjes.kerberos.properties -param_file LogFormats.properties 
./useragent.pig{code}

So I made sure I was logged out of Kerberos and then i ran the script against a 
Kerberos secured cluster. 
Even though the script lasted for over 27 minutes  the while thing ran 
successfully. 
I verified the output of this script and this was correct.

The script I ran (from the pig source directory):
{code}REGISTER ./contrib/piggybank/java/piggybank.jar ;
REGISTER ./lib/*.jar ;

UserAgents =
  LOAD '$LOGFILE'
  USING org.apache.pig.piggybank.storage.apachelog.LogFormatLoader( 
'$LOGFORMAT',
'HTTP.USERAGENT:request.user-agent'
) AS (
useragent:chararray
);

UserAgentsCount =
FOREACH  UserAgents
GENERATE useragent AS useragent:chararray,
 1LAS clicks:long;

CountsPerUseragents =
GROUP UserAgentsCount
BY(useragent);

SumsPerBrowser =
FOREACH  CountsPerUseragents
GENERATE SUM(UserAgentsCount.clicks) AS clicks,
 group   AS useragent;

STORE SumsPerBrowser
INTO  'TopUseragents'
USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t','NO_MULTILINE', 
'UNIX');
{code}

[~daijy]: Is this the type of manual test you think is correct?


> Authenticate with Kerberos using a keytab file
> --
>
> Key: PIG-4796
> URL: https://issues.apache.org/jira/browse/PIG-4796
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.15.0
>Reporter: Niels Basjes
>Assignee: Niels Basjes
>  Labels: feature, kerberos, security
> Attachments: 2016-02-18-1510-PIG-4796.patch, 
> 2016-02-18-PIG-4796-rough-proof-of-concept.patch, PIG-4796-2016-02-23.patch
>
>
> When running in a Kerberos secured environment users are faced with the 
> limitation that their jobs cannot run longer than the (remaining) ticket 
> lifetime of their Kerberos tickets. The environment I work in these tickets 
> expire after 10 hours, thus limiting the maximum job duration to at most 10 
> hours (which is a problem).
> In the Hadoop tooling there is a feature where you can authenticate using a 
> Kerberos keytab file (essentially a file that contains the encrypted form of 
> the kerberos principal and password). Using this the running application can 
> request new tickets from the Kerberos server when the initial tickets expire.
> In my Java/Hadoop applications I commonly include these two lines:
> {code}
> System.setProperty("java.security.krb5.conf", "/etc/krb5.conf");
> UserGroupInformation.loginUserFromKeytab("nbas...@xx.net", 
> "/home/nbasjes/.krb/nbasjes.keytab");
> {code}
> This way I have run an Apache Flink based application for more than 170 hours 
> (about a week) on the kerberos secured Yarn cluster.
> What I propose is to have a feature that I can set the relevant kerberos 
> values in my pig script and from there be able to run a pig job for many days 
> on the secured cluster.
> Proposal how this can look in a pig script:
> {code}
> SET java.security.krb5.conf '/etc/krb5.conf'
> SET job.security.krb5.principal 'nbas...@xx.net'
> SET job.security.krb5.keytab '/home/nbasjes/.krb/nbasjes.keytab'
> {code}
> So iff all of these are set (or at least the last two) then the 
> aforementioned  UserGroupInformation.loginUserFromKeytab method is called 
> before submitting the job to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4817) Bump HTTP Logparser to version 2.4

2016-02-29 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated PIG-4817:
--
Status: Patch Available  (was: Open)

I ran a few tests and this works as I expect it to run.

> Bump HTTP Logparser to version 2.4
> --
>
> Key: PIG-4817
> URL: https://issues.apache.org/jira/browse/PIG-4817
> Project: Pig
>  Issue Type: Improvement
>Reporter: Niels Basjes
>Assignee: Niels Basjes
> Attachments: PIG-4817-20160229.patch
>
>
> Main reason for the update is this fix:
> Now support parsing the first line even if it is chopped by Apache httpd 
> because of an URI longer than 8000 bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4776) Enable unit test "TestOrcStoragePushdown" for spark

2016-02-29 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4776.
--
Resolution: Fixed

The patch is committed to Spark branch. Thanks, Liyun!

> Enable unit test "TestOrcStoragePushdown" for spark
> ---
>
> Key: PIG-4776
> URL: https://issues.apache.org/jira/browse/PIG-4776
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4776.patch
>
>
> In latest jenkins 
> report(https://builds.apache.org/job/Pig-spark/292/#showFailuresLink), it 
> shows that following unit tests fail:
> org.apache.pig.builtin.TestOrcStoragePushdown.testColumnPruning
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBigDecimal
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownTimestamp
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownChar
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownByteShort
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownFloatDouble
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownIntLongString
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBoolean
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownVarchar
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4776) Enable unit test "TestOrcStoragePushdown" for spark

2016-02-29 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171895#comment-15171895
 ] 

Xuefu Zhang commented on PIG-4776:
--

+1

> Enable unit test "TestOrcStoragePushdown" for spark
> ---
>
> Key: PIG-4776
> URL: https://issues.apache.org/jira/browse/PIG-4776
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4776.patch
>
>
> In latest jenkins 
> report(https://builds.apache.org/job/Pig-spark/292/#showFailuresLink), it 
> shows that following unit tests fail:
> org.apache.pig.builtin.TestOrcStoragePushdown.testColumnPruning
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBigDecimal
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownTimestamp
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownChar
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownByteShort
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownFloatDouble
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownIntLongString
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownBoolean
> org.apache.pig.builtin.TestOrcStoragePushdown.testPredicatePushdownVarchar
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4243) Fix "TestStore" for Spark engine

2016-02-29 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4243.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Liyun!

> Fix "TestStore" for Spark engine
> 
>
> Key: PIG-4243
> URL: https://issues.apache.org/jira/browse/PIG-4243
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4243.patch, PIG-4243_1.patch, 
> TEST-org.apache.pig.test.TestStore.txt
>
>
> 1. Build spark and pig env according to PIG-4168
> 2. add TestStore to $PIG_HOME/test/spark-tests
> cat  $PIG_HOME/test/spark-tests
> **/TestStore
> 3. run unit test TestStore
> ant test-spark
> 4. the unit test fails
> error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4817) Bump HTTP Logparser to version 2.4

2016-02-29 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated PIG-4817:
--
Attachment: PIG-4817-20160229.patch

> Bump HTTP Logparser to version 2.4
> --
>
> Key: PIG-4817
> URL: https://issues.apache.org/jira/browse/PIG-4817
> Project: Pig
>  Issue Type: Improvement
>Reporter: Niels Basjes
>Assignee: Niels Basjes
> Attachments: PIG-4817-20160229.patch
>
>
> Main reason for the update is this fix:
> Now support parsing the first line even if it is chopped by Apache httpd 
> because of an URI longer than 8000 bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4813) AvroStorage doesn't work for schema from external file for EMR

2016-02-29 Thread Jagdish Kewat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagdish Kewat resolved PIG-4813.

Resolution: Not A Bug

Works on HDFS as well. 
The key is to use "*org.apache.pig.builtin.AvroStorage*" instead of 
"*org.apache.pig.piggybank.storage.avro.AvroStorage*"

Resolving as not a bug.

> AvroStorage doesn't work for schema from external file for EMR
> --
>
> Key: PIG-4813
> URL: https://issues.apache.org/jira/browse/PIG-4813
> Project: Pig
>  Issue Type: Bug
>Reporter: Jagdish Kewat
>
> Hi Team,
> I couldn't get the schema loading for AvroStorage as described in 
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-etl-avro.html
>  working. 
> It works fine if I provide the raw schema string with option 'schema' as 
> described in https://cwiki.apache.org/confluence/display/PIG/AvroStorage.
> On HDFS I don't even need to specify the schema with store command.
> A quick insights regarding the versions.
> * Hadoop :
> {code}
> Hadoop 2.6.0-amzn-2
> Subversion g...@aws157git.com:/pkg/Aws157BigTop -r 
> 41f4e6be3ac5d6676a3464f77de79a33e8fdd9f3
> Compiled by ec2-user on 2015-11-16T20:56Z
> Compiled with protoc 2.5.0
> {code}
> * Pig :
> {code}
> Apache Pig version 0.14.0-amzn-0 (r: unknown)
> {code}
> * piggybank jar version:
> ** piggybank-0.14.0.jar
> * avro jar version :
> ** avro-1.7.7.jar
> * avro-ipc jar version :
> ** avro-ipc-1.7.7.jar
> * json-simple jar version
> ** json-simple-1.1.jar
> I tried looking for any pibbybank version of jar for EMR however no luck. I 
> fear I am not using correct versions of jars since the feature should work as 
> it has been documented. 
> Please advise if I am missing anything.
> Thanks,
> Jagdish
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4817) Bump HTTP Logparser to version 2.4

2016-02-29 Thread Niels Basjes (JIRA)
Niels Basjes created PIG-4817:
-

 Summary: Bump HTTP Logparser to version 2.4
 Key: PIG-4817
 URL: https://issues.apache.org/jira/browse/PIG-4817
 Project: Pig
  Issue Type: Improvement
Reporter: Niels Basjes
Assignee: Niels Basjes


Main reason for the update is this fix:
Now support parsing the first line even if it is chopped by Apache httpd 
because of an URI longer than 8000 bytes.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4813) AvroStorage doesn't work for schema from external file for EMR

2016-02-29 Thread Jagdish Kewat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171582#comment-15171582
 ] 

Jagdish Kewat commented on PIG-4813:


Thanks [~daijy] !

The org.apache.pig.builtin.AvroStorage worked. Need to check if this works on 
HDFS as well.

Regards,
Jagdish

> AvroStorage doesn't work for schema from external file for EMR
> --
>
> Key: PIG-4813
> URL: https://issues.apache.org/jira/browse/PIG-4813
> Project: Pig
>  Issue Type: Bug
>Reporter: Jagdish Kewat
>
> Hi Team,
> I couldn't get the schema loading for AvroStorage as described in 
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-etl-avro.html
>  working. 
> It works fine if I provide the raw schema string with option 'schema' as 
> described in https://cwiki.apache.org/confluence/display/PIG/AvroStorage.
> On HDFS I don't even need to specify the schema with store command.
> A quick insights regarding the versions.
> * Hadoop :
> {code}
> Hadoop 2.6.0-amzn-2
> Subversion g...@aws157git.com:/pkg/Aws157BigTop -r 
> 41f4e6be3ac5d6676a3464f77de79a33e8fdd9f3
> Compiled by ec2-user on 2015-11-16T20:56Z
> Compiled with protoc 2.5.0
> {code}
> * Pig :
> {code}
> Apache Pig version 0.14.0-amzn-0 (r: unknown)
> {code}
> * piggybank jar version:
> ** piggybank-0.14.0.jar
> * avro jar version :
> ** avro-1.7.7.jar
> * avro-ipc jar version :
> ** avro-ipc-1.7.7.jar
> * json-simple jar version
> ** json-simple-1.1.jar
> I tried looking for any pibbybank version of jar for EMR however no luck. I 
> fear I am not using correct versions of jars since the feature should work as 
> it has been documented. 
> Please advise if I am missing anything.
> Thanks,
> Jagdish
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4243) Fix "TestStore" for Spark engine

2016-02-29 Thread Pallavi Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171562#comment-15171562
 ] 

Pallavi Rao commented on PIG-4243:
--

+1 for the new patch. [~xuefuz], please commit.

> Fix "TestStore" for Spark engine
> 
>
> Key: PIG-4243
> URL: https://issues.apache.org/jira/browse/PIG-4243
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4243.patch, PIG-4243_1.patch, 
> TEST-org.apache.pig.test.TestStore.txt
>
>
> 1. Build spark and pig env according to PIG-4168
> 2. add TestStore to $PIG_HOME/test/spark-tests
> cat  $PIG_HOME/test/spark-tests
> **/TestStore
> 3. run unit test TestStore
> ant test-spark
> 4. the unit test fails
> error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)