[jira] [Commented] (PIG-2166) UDFs to join a bag

2012-06-08 Thread Hien Luu (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292185#comment-13292185
 ] 

Hien Luu commented on PIG-2166:
---

Yes, I am able to run e2e tests.  I am hoping to finish adding tests to 
nightly.conf this weekend :)

> UDFs to join a bag
> --
>
> Key: PIG-2166
> URL: https://issues.apache.org/jira/browse/PIG-2166
> Project: Pig
>  Issue Type: Improvement
>Reporter: Daniel Dai
>Assignee: Hien Luu
>Priority: Minor
>  Labels: newbie, simple
> Attachments: PIG-2166.diff, bagtotuplestring.diff, 
> test_harnesss_1338753364
>
>
> Get several request for a UDF to flatten a bag. Seems reasonable to create 
> one in builtin:
> 1. BagToTuple: {(a),(b),(c)} -> (a,b,c)
> 2. BagToString(delimit="_"): {(a),(b),(c) -> "a_b_c"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Handle NULL values in Cube dimensions

2012-06-08 Thread Jonathan Coveney
you could always make the value pluggable, going with Unknown for now, and
then down the line if we want, we could add an "ONNULL" value to the parser
that sets it.

2012/6/8 Prasanth J 

> Thanks Alan and Dmitriy for your thoughts.
>
> I think we have two different approaches now.
>
> In one approach, if we encounter a null in dimension values we can just
> label it as "unknown" and use "NULL" string to represent rollups. Whereas,
> in other approach, if we encounter a null in dimension values, use the null
> value as such but use "*" or any other string for rollups.
>
> Both approaches looks good to me. Please let me know which one should I go
> ahead with.
>
> Thanks
> -- Prasanth
>
> On Jun 8, 2012, at 12:22 PM, Alan Gates wrote:
>
> > Option 1 (throwing an error) is bad.  It violates "Pigs eat anything"
> (see http://pig.apache.org/philosophy.html).
> >
> > Do we need to give users an ability to name this unknown column?  Why
> not just label it "unknown" and be done?
> >
> > Alan.
> >
> > On Jun 6, 2012, at 2:24 PM, Prasanth J wrote:
> >
> >> Hello everyone
> >>
> >> I would like to bring up this discussion about the ways for handling
> NULL values in dimensions specified for cubing. For example, if we have a
> dimension color with following values
> >>
> >> red
> >> blue
> >> null
> >> green
> >>
> >> how do we differentiate if the null value represent rollup of all
> colors values or actual null value?
> >>
> >> SQL way:
> >> There are 2 ways in which SQL server analysis services handles null
> values in dimensions
> >> 1) Throw error when it encounters null values in dimension values
> >> 2) Ignore error by adding the null values to UnknownMembers. By default
> UnknownMembers will be named as "Unknown". The name for UnknownMembers can
> also be specified by the user.
> >>
> >> Do we need to handle both ways in Pig? I think the first way (throwing
> error) is pretty straightforward.
> >> For the second way (ignoring error), what is the best way to provide
> support for user specified name for UnknownMembers?
> >>
> >> Please share your thoughts about how we can handle this scenario for
> different datatypes in Pig.
> >>
> >> Thanks
> >> -- Prasanth
> >>
> >
>
>


[jira] [Commented] (PIG-2532) Registered classes fail deserialization in frontend

2012-06-08 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292169#comment-13292169
 ] 

Aniket Mokashi commented on PIG-2532:
-

I tested on trunk, its not fully fixed. If we register the jar from s3, it 
fails with the same error. I will open another jira for the same.

> Registered classes fail deserialization in frontend
> ---
>
> Key: PIG-2532
> URL: https://issues.apache.org/jira/browse/PIG-2532
> Project: Pig
>  Issue Type: Bug
>Reporter: Travis Crawford
>Assignee: Travis Crawford
> Fix For: 0.10.0, 0.9.3, 0.11
>
> Attachments: PIG-2532-h23.patch, PIG-2532-log.zip, PIG-2532-v2.patch, 
> PIG-2532-v3.patch, PIG-2532-v4-branch-0.9.patch, PIG-2532-v4.patch, 
> PIG-2532.patch, PIG-253_javax.zip
>
>
> This issue came up while integrating HCatalog with our environment. HCatalog 
> jars are added to the pig command-line with {{-Dpig.additional.jars}} but 
> fails (exception below). When added to the pig classpath the error goes away.
> We identified the issue as deserialization using the root class loader, not 
> the context class loader set when the thread is created. This causes 
> HCatSchema which is serialized into the context to fail deserialization in 
> the thread.
> {code}
> 2012-02-14 21:55:53,936 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 6017: java.io.IOException: Deserialization error: 
> org.apache.hcatalog.data.schema.HCatSchema
>   at 
> org.apache.pig.impl.util.ObjectSerializer.deserialize(ObjectSerializer.java:55)
>   at org.apache.pig.impl.util.UDFContext.deserialize(UDFContext.java:181)
>   at 
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil.setupUDFContext(MapRedUtil.java:159)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setupUdfEnvAndStores(PigOutputFormat.java:229)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:186)
>   at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:811)
>   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771)
>   at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
>   at 
> org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigJobControl.mainLoopAction(PigJobControl.java:144)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigJobControl.run(PigJobControl.java:121)
>   at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hcatalog.data.schema.HCatSchema
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:247)
>   at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:603)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1574)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1495)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1731)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>   at java.util.Hashtable.readObject(Hashtable.java:859)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>   at java.util.Ha

Re: Handle NULL values in Cube dimensions

2012-06-08 Thread Prasanth J
Thanks Alan and Dmitriy for your thoughts.

I think we have two different approaches now.

In one approach, if we encounter a null in dimension values we can just label 
it as "unknown" and use "NULL" string to represent rollups. Whereas, in other 
approach, if we encounter a null in dimension values, use the null value as 
such but use "*" or any other string for rollups. 

Both approaches looks good to me. Please let me know which one should I go 
ahead with. 

Thanks
-- Prasanth

On Jun 8, 2012, at 12:22 PM, Alan Gates wrote:

> Option 1 (throwing an error) is bad.  It violates "Pigs eat anything" (see 
> http://pig.apache.org/philosophy.html).  
> 
> Do we need to give users an ability to name this unknown column?  Why not 
> just label it "unknown" and be done?
> 
> Alan.
> 
> On Jun 6, 2012, at 2:24 PM, Prasanth J wrote:
> 
>> Hello everyone
>> 
>> I would like to bring up this discussion about the ways for handling NULL 
>> values in dimensions specified for cubing. For example, if we have a 
>> dimension color with following values
>> 
>> red
>> blue
>> null
>> green
>> 
>> how do we differentiate if the null value represent rollup of all colors 
>> values or actual null value? 
>> 
>> SQL way: 
>> There are 2 ways in which SQL server analysis services handles null values 
>> in dimensions 
>> 1) Throw error when it encounters null values in dimension values
>> 2) Ignore error by adding the null values to UnknownMembers. By default 
>> UnknownMembers will be named as "Unknown". The name for UnknownMembers can 
>> also be specified by the user.
>> 
>> Do we need to handle both ways in Pig? I think the first way (throwing 
>> error) is pretty straightforward.
>> For the second way (ignoring error), what is the best way to provide support 
>> for user specified name for UnknownMembers? 
>> 
>> Please share your thoughts about how we can handle this scenario for 
>> different datatypes in Pig. 
>> 
>> Thanks
>> -- Prasanth
>> 
> 



[jira] [Commented] (PIG-2166) UDFs to join a bag

2012-06-08 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292149#comment-13292149
 ] 

Daniel Dai commented on PIG-2166:
-

Hi, Hien,
Are you able to run e2e tests? Do you need any help?

> UDFs to join a bag
> --
>
> Key: PIG-2166
> URL: https://issues.apache.org/jira/browse/PIG-2166
> Project: Pig
>  Issue Type: Improvement
>Reporter: Daniel Dai
>Assignee: Hien Luu
>Priority: Minor
>  Labels: newbie, simple
> Attachments: PIG-2166.diff, bagtotuplestring.diff, 
> test_harnesss_1338753364
>
>
> Get several request for a UDF to flatten a bag. Seems reasonable to create 
> one in builtin:
> 1. BagToTuple: {(a),(b),(c)} -> (a,b,c)
> 2. BagToString(delimit="_"): {(a),(b),(c) -> "a_b_c"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2744) Handle Pig command line with XML special characters

2012-06-08 Thread Richard Ding (JIRA)
Richard Ding created PIG-2744:
-

 Summary: Handle Pig command line with XML special characters
 Key: PIG-2744
 URL: https://issues.apache.org/jira/browse/PIG-2744
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.10.0
Reporter: Richard Ding
Assignee: Richard Ding


Pig stores Pig command line string to the Hadoop job XML file. It will fail if 
the command line string contains XML special characters. Pig should treat the 
command string like Pig script by first encoding it. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2651) Provide a much easier to use accumulator interface

2012-06-08 Thread Jonathan Coveney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292104#comment-13292104
 ] 

Jonathan Coveney commented on PIG-2651:
---

Thanks Daniel!

> Provide a much easier to use accumulator interface
> --
>
> Key: PIG-2651
> URL: https://issues.apache.org/jira/browse/PIG-2651
> Project: Pig
>  Issue Type: New Feature
>Reporter: Jonathan Coveney
>Assignee: Jonathan Coveney
> Fix For: 0.11
>
> Attachments: PIG-2651-0.patch, PIG-2651-1.patch, PIG-2651-2.patch
>
>
> This introduces a new interface, IteratingAccumulatorEvalFunc (that name is 
> NOT final...). The cool thing about this patch is that it is built purely on 
> top of the existing Accumulator code (well, it uses PIG-2066, but it could 
> easily work without it). That is to say, it's an easier way to write 
> accumulators without having to fork the Pig codebase.
> The downside is that the only way I am able to provide such a clean interface 
> is by using a second thread. I need to explore any potential performance 
> implications, but given that most of the easy to use Pig stuff has 
> performance implications, I think as long as we measure and and document 
> them, it's worth the much more usable interface. Plus I don't think it will 
> be too bad as one thread does the heavy lifting, while another just ferries 
> values in between. SUM could now be written as:
> {code}
> public class SUM extends IteratingAccumulatorEvalFunc {
> public Long exec(Iterator it) throws IOException {
> long sum = 0;
> while (it.hasNext()) {
> sum += (Long)it.next().get(0);
> }
> return sum;
> }
> }
> {code}
> Besides performance tests, I need to figure out how to properly test this 
> sort of thing. I particularly welcome advice on that front.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Pig blog?

2012-06-08 Thread Jonathan Coveney
I think it would be really neat to have a lightweight pig blog that would
allow contributors and motivated users an opportunity to show off what they
are working on. What I have in mind would be something relatively low
friction, where people could talk about their use case, features they are
working on, or longer posts to show off features that are being implements.

Thoughts?


Build failed in Jenkins: Pig-trunk #1260

2012-06-08 Thread Apache Jenkins Server
See 

Changes:

[daijy] PIG-2651: Provide a much easier to use accumulator interface

[daijy] Disable Scripting_12 for hadoop 23

[daijy] PIG-2593: Filter by a boolean value does not work

[daijy] PIG-2665: Bundled Jython jar in Pig 0.10.0-RC breaks module import in 
Python scripts with embedded Pig Latin

[daijy] PIG-2736: Support implicit cast from bytearray to boolean

[daijy] PIG-2741: Python script throws an NameError: name 'Configuration' is 
not defined in case cache dir is not created

[daijy] PIG-2669: Pig release should include pig-default.properties after 
rebuild

--
[...truncated 6636 lines...]
 [findbugs]   org.python.core.PyCode
 [findbugs]   com.jcraft.jsch.HostKey
 [findbugs]   org.apache.hadoop.hbase.filter.Filter
 [findbugs]   org.python.core.PyBoolean
 [findbugs]   org.apache.commons.logging.Log
 [findbugs]   org.jruby.RubyFloat
 [findbugs]   com.google.common.util.concurrent.ListenableFuture
 [findbugs]   org.apache.hadoop.util.RunJar
 [findbugs]   org.jruby.RubyBoolean
 [findbugs]   org.apache.hadoop.mapred.Counters$Group
 [findbugs]   com.jcraft.jsch.ChannelExec
 [findbugs]   org.apache.hadoop.hbase.util.Base64
 [findbugs]   org.antlr.runtime.TokenStream
 [findbugs]   org.apache.hadoop.io.IOUtils
 [findbugs]   org.jruby.RubyBignum
 [findbugs]   com.google.common.util.concurrent.CheckedFuture
 [findbugs]   org.apache.hadoop.io.file.tfile.TFile$Reader$Scanner$Entry
 [findbugs]   org.apache.hadoop.fs.FSDataInputStream
 [findbugs]   org.python.core.PyObject
 [findbugs]   jline.History
 [findbugs]   org.jruby.embed.internal.LocalContextProvider
 [findbugs]   org.apache.hadoop.io.BooleanWritable
 [findbugs]   org.apache.log4j.Logger
 [findbugs]   org.apache.hadoop.hbase.filter.FamilyFilter
 [findbugs]   org.antlr.runtime.IntStream
 [findbugs]   org.apache.hadoop.util.ReflectionUtils
 [findbugs]   org.apache.hadoop.fs.ContentSummary
 [findbugs]   org.jruby.runtime.builtin.IRubyObject
 [findbugs]   org.jruby.RubyInteger
 [findbugs]   org.python.core.PyTuple
 [findbugs]   org.apache.hadoop.conf.Configuration
 [findbugs]   com.google.common.base.Joiner
 [findbugs]   org.apache.hadoop.mapreduce.lib.input.FileSplit
 [findbugs]   org.apache.hadoop.mapred.Counters$Counter
 [findbugs]   com.jcraft.jsch.Channel
 [findbugs]   org.apache.hadoop.mapred.JobPriority
 [findbugs]   org.apache.commons.cli.Options
 [findbugs]   org.apache.hadoop.mapred.JobID
 [findbugs]   org.apache.hadoop.util.bloom.BloomFilter
 [findbugs]   org.python.core.PyFrame
 [findbugs]   org.apache.hadoop.hbase.filter.CompareFilter
 [findbugs]   org.apache.hadoop.util.VersionInfo
 [findbugs]   org.python.core.PyString
 [findbugs]   org.apache.hadoop.io.Text$Comparator
 [findbugs]   org.jruby.runtime.Block
 [findbugs]   org.antlr.runtime.MismatchedSetException
 [findbugs]   org.apache.hadoop.io.BytesWritable
 [findbugs]   org.apache.hadoop.fs.FsShell
 [findbugs]   org.mozilla.javascript.ImporterTopLevel
 [findbugs]   org.apache.hadoop.hbase.mapreduce.TableOutputFormat
 [findbugs]   org.apache.hadoop.mapred.TaskReport
 [findbugs]   org.antlr.runtime.tree.RewriteRuleSubtreeStream
 [findbugs]   org.apache.commons.cli.HelpFormatter
 [findbugs]   com.google.common.collect.Maps
 [findbugs]   org.mozilla.javascript.NativeObject
 [findbugs]   org.apache.hadoop.hbase.HConstants
 [findbugs]   org.apache.hadoop.io.serializer.Deserializer
 [findbugs]   org.antlr.runtime.FailedPredicateException
 [findbugs]   org.apache.hadoop.io.compress.CompressionCodec
 [findbugs]   org.jruby.RubyNil
 [findbugs]   org.apache.hadoop.fs.FileStatus
 [findbugs]   org.apache.hadoop.hbase.client.Result
 [findbugs]   org.apache.hadoop.mapreduce.JobContext
 [findbugs]   org.codehaus.jackson.JsonGenerator
 [findbugs]   org.apache.hadoop.mapreduce.TaskAttemptContext
 [findbugs]   org.apache.hadoop.io.BytesWritable$Comparator
 [findbugs]   org.apache.hadoop.io.LongWritable$Comparator
 [findbugs]   org.codehaus.jackson.map.util.LRUMap
 [findbugs]   org.apache.hadoop.hbase.util.Bytes
 [findbugs]   org.antlr.runtime.MismatchedTokenException
 [findbugs]   org.codehaus.jackson.JsonParser
 [findbugs]   com.jcraft.jsch.UserInfo
 [findbugs]   org.python.core.PyException
 [findbugs]   org.apache.commons.cli.ParseException
 [findbugs]   org.apache.hadoop.io.compress.CompressionOutputStream
 [findbugs]   org.apache.hadoop.hbase.filter.WritableByteArrayComparable
 [findbugs]   org.antlr.runtime.tree.CommonTreeNodeStream
 [findbugs]   org.apache.log4j.Level
 [findbugs]   org.apache.hadoop.hbase.client.Scan
 [findbugs]   org.jruby.anno.JRubyMethod
 [findbugs]   org.apache.hadoop.mapreduce.Job
 [findbugs]   com.google.common.util.concurrent.Futures
 [findbugs]   org.apache.commons.logging.LogFactory
 [findbugs]   org.apache.commons.codec.binary.Base64
 [findbugs]   org.codehaus.jackson.map.ObjectMapper
 [findbugs]   org.apache.hadoop.fs.FileSystem
 [findbugs]   org.jruby.embed.LocalContextScop

[jira] [Updated] (PIG-2651) Provide a much easier to use accumulator interface

2012-06-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2651:


   Resolution: Fixed
Fix Version/s: (was: 0.10.1)
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

+1 for the patch.

Committed to trunk. I don't feel it should be backported to 0.10 branch. Drop 
the 0.10 from fix version.

> Provide a much easier to use accumulator interface
> --
>
> Key: PIG-2651
> URL: https://issues.apache.org/jira/browse/PIG-2651
> Project: Pig
>  Issue Type: New Feature
>Reporter: Jonathan Coveney
>Assignee: Jonathan Coveney
> Fix For: 0.11
>
> Attachments: PIG-2651-0.patch, PIG-2651-1.patch, PIG-2651-2.patch
>
>
> This introduces a new interface, IteratingAccumulatorEvalFunc (that name is 
> NOT final...). The cool thing about this patch is that it is built purely on 
> top of the existing Accumulator code (well, it uses PIG-2066, but it could 
> easily work without it). That is to say, it's an easier way to write 
> accumulators without having to fork the Pig codebase.
> The downside is that the only way I am able to provide such a clean interface 
> is by using a second thread. I need to explore any potential performance 
> implications, but given that most of the easy to use Pig stuff has 
> performance implications, I think as long as we measure and and document 
> them, it's worth the much more usable interface. Plus I don't think it will 
> be too bad as one thread does the heavy lifting, while another just ferries 
> values in between. SUM could now be written as:
> {code}
> public class SUM extends IteratingAccumulatorEvalFunc {
> public Long exec(Iterator it) throws IOException {
> long sum = 0;
> while (it.hasNext()) {
> sum += (Long)it.next().get(0);
> }
> return sum;
> }
> }
> {code}
> Besides performance tests, I need to figure out how to properly test this 
> sort of thing. I particularly welcome advice on that front.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-1314) Add DateTime Support to Pig

2012-06-08 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291961#comment-13291961
 ] 

Russell Jurney commented on PIG-1314:
-

Avro might store DateTimes as an ISO string?

> Add DateTime Support to Pig
> ---
>
> Key: PIG-1314
> URL: https://issues.apache.org/jira/browse/PIG-1314
> Project: Pig
>  Issue Type: Bug
>  Components: data
>Affects Versions: 0.7.0
>Reporter: Russell Jurney
>Assignee: Zhijie Shen
>  Labels: gsoc2012
> Attachments: PIG-1314-1.patch, joda_vs_builtin.zip
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Hadoop/Pig are primarily used to parse log data, and most logs have a 
> timestamp component.  Therefore Pig should support dates as a primitive.
> Can someone familiar with adding types to pig comment on how hard this is?  
> We're looking at doing this, rather than use UDFs.  Is this a patch that 
> would be accepted?
> This is a candidate project for Google summer of code 2012. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-1314) Add DateTime Support to Pig

2012-06-08 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated PIG-1314:
-

Attachment: (was: PIG-1314-1.patch)

> Add DateTime Support to Pig
> ---
>
> Key: PIG-1314
> URL: https://issues.apache.org/jira/browse/PIG-1314
> Project: Pig
>  Issue Type: Bug
>  Components: data
>Affects Versions: 0.7.0
>Reporter: Russell Jurney
>Assignee: Zhijie Shen
>  Labels: gsoc2012
> Attachments: PIG-1314-1.patch, joda_vs_builtin.zip
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Hadoop/Pig are primarily used to parse log data, and most logs have a 
> timestamp component.  Therefore Pig should support dates as a primitive.
> Can someone familiar with adding types to pig comment on how hard this is?  
> We're looking at doing this, rather than use UDFs.  Is this a patch that 
> would be accepted?
> This is a candidate project for Google summer of code 2012. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-1314) Add DateTime Support to Pig

2012-06-08 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated PIG-1314:
-

Attachment: PIG-1314-1.patch
PIG-1314-1.patch

I've modified the codes in the src package related to the primitive DateTime 
(see the attached file). As the code related to data type is widely spread in 
the project, I still need to go through it more times to figure the potential 
missing parts.

Up till now, there's some more issues that need to be discussed:

1. Pig can also import into and export from HBase storage, which also doesn't 
have the primitive DataTime. Throw exception in this case as well, correct?

2.For the type casting between DateTime and other types of data, how about 
following the rules below:
a. Allow: DateTime <-- Numeric value (being converted to Long first)
b. Allow: DateTime <-- String
c. Not allow: DateTime <-- Boolean
d. Only explicit casting allowed

3. DateTime is serialized as a Long value (Unix timestamp) when it is necessary.

> Add DateTime Support to Pig
> ---
>
> Key: PIG-1314
> URL: https://issues.apache.org/jira/browse/PIG-1314
> Project: Pig
>  Issue Type: Bug
>  Components: data
>Affects Versions: 0.7.0
>Reporter: Russell Jurney
>Assignee: Zhijie Shen
>  Labels: gsoc2012
> Attachments: PIG-1314-1.patch, joda_vs_builtin.zip
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Hadoop/Pig are primarily used to parse log data, and most logs have a 
> timestamp component.  Therefore Pig should support dates as a primitive.
> Can someone familiar with adding types to pig comment on how hard this is?  
> We're looking at doing this, rather than use UDFs.  Is this a patch that 
> would be accepted?
> This is a candidate project for Google summer of code 2012. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2593) Filter by a boolean value does not work

2012-06-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2593:


   Resolution: Fixed
Fix Version/s: 0.11
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

All unit tests pass. 

Patch committed to trunk. Thanks Jie!

> Filter by a boolean value does not work
> ---
>
> Key: PIG-2593
> URL: https://issues.apache.org/jira/browse/PIG-2593
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Reporter: Daniel Dai
>Assignee: Jie Li
> Fix For: 0.11
>
> Attachments: PIG-2593.patch
>
>
> The following script does not work:
> {code}
> a = load 'allscalar10k' as (name, age, gpa, instate);
> b = filter a by instate;
> explain b;
> {code}
> Exception:
> ERROR 1200:   mismatched input ';' expecting 
> IS
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
> parsing.   mismatched input ';' expecting IS
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1541)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:541)
> at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:945)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:392)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
> at org.apache.pig.Main.run(Main.java:599)
> at org.apache.pig.Main.main(Main.java:153)
> Caused by: Failed to parse:   mismatched 
> input ';' expecting IS
> at 
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:222)
> at 
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:164)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1590)
> ... 9 more
> It works if we change the script into:
> {code}
> a = load 'allscalar10k' as (name, age, gpa, instate);
> b = filter a by instate==TRUE;
> explain b;
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Handle NULL values in Cube dimensions

2012-06-08 Thread Alan Gates
Option 1 (throwing an error) is bad.  It violates "Pigs eat anything" (see 
http://pig.apache.org/philosophy.html).  

Do we need to give users an ability to name this unknown column?  Why not just 
label it "unknown" and be done?

Alan.

On Jun 6, 2012, at 2:24 PM, Prasanth J wrote:

> Hello everyone
> 
> I would like to bring up this discussion about the ways for handling NULL 
> values in dimensions specified for cubing. For example, if we have a 
> dimension color with following values
> 
> red
> blue
> null
> green
> 
> how do we differentiate if the null value represent rollup of all colors 
> values or actual null value? 
> 
> SQL way: 
> There are 2 ways in which SQL server analysis services handles null values in 
> dimensions 
> 1) Throw error when it encounters null values in dimension values
> 2) Ignore error by adding the null values to UnknownMembers. By default 
> UnknownMembers will be named as "Unknown". The name for UnknownMembers can 
> also be specified by the user.
> 
> Do we need to handle both ways in Pig? I think the first way (throwing error) 
> is pretty straightforward.
> For the second way (ignoring error), what is the best way to provide support 
> for user specified name for UnknownMembers? 
> 
> Please share your thoughts about how we can handle this scenario for 
> different datatypes in Pig. 
> 
> Thanks
> -- Prasanth
> 



[jira] [Commented] (PIG-2743) Output Schema

2012-06-08 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291801#comment-13291801
 ] 

Gianmarco De Francisci Morales commented on PIG-2743:
-

The alternative option would be to prepend the rank to the tuple (akin to line 
numbers).
The advantage would be you always know where your rank field will end up (i.e. 
$0).
But I have no strong opinion on it.
Anybody else cares to comment?

> Output Schema
> -
>
> Key: PIG-2743
> URL: https://issues.apache.org/jira/browse/PIG-2743
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Allan Avendaño
>Assignee: Allan Avendaño
>
> For the rank operator, I was considering the following schema:
> E.g.
> A = load 'data' as (x:int,y:chararray,z:int,rz:chararray);
> C = rank A by x;
> So the output schema could be: 
> C: {x: int,y: chararray,z: int,rz: chararray,A::rank: int}
> In general 
> {,::rank#int}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2742) Rank Operator Syntax

2012-06-08 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291799#comment-13291799
 ] 

Gianmarco De Francisci Morales commented on PIG-2742:
-

I think the front end part looks good.
Let's now move to the logical and physical implementation.

> Rank Operator Syntax
> 
>
> Key: PIG-2742
> URL: https://issues.apache.org/jira/browse/PIG-2742
> Project: Pig
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: 0.10.0
>Reporter: Allan Avendaño
>Assignee: Allan Avendaño
> Attachments: PIG-2742
>
>
> The syntax proposed is the following:
> RANK  (BY (|)+)?
> Which now is running on the patch attached with the code implemented so far, 
> with the corresponding tests.
> Looking forward to reading your comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (PIG-2665) Bundled Jython jar in Pig 0.10.0-RC breaks module import in Python scripts with embedded Pig Latin

2012-06-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-2665.
-

  Resolution: Fixed
Hadoop Flags: Reviewed

Patch committed to trunk. Thank Julien for reviewing!

> Bundled Jython jar in Pig 0.10.0-RC breaks module import in Python scripts 
> with embedded Pig Latin
> --
>
> Key: PIG-2665
> URL: https://issues.apache.org/jira/browse/PIG-2665
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: Verified bug on RHEL6 and on Ubuntu 11.10 with Sun JDK 
> 1.6, and both Jython 2.5.0 (shipped with the Pig 0.10.0 RC package) and 
> Jython 2.5.2.
>Reporter: Michael Noll
>Assignee: Daniel Dai
> Fix For: 0.11
>
> Attachments: PIG-2665-1.patch, PIG-2665-2.patch
>
>
> Using Pig 0.9.0 I was running into PIG-1824 when using import statements 
> (e.g. {{import os}}) in a Python script with embedded Pig Latin.  Dmitriy 
> Ryaboy pointed me to the new Pig 0.10 release candidate 
> (http://people.apache.org/~daijy/pig-0.10.0-candidate-0/pig-0.10.0.tar.gz) so 
> that I could test whether my issue was solved by the new Pig version.  During 
> testing I run into the error described below.
> *Summary (TL;DR)*
> * Even a minimal Python script with embedded Pig Latin will throw an error if 
> there is a single import statement in the Python code.
> * The fix is to replace the bundled {{lib/jython.jar}} with a standalone 
> version of the same jar.
> *Error message: "ERROR 1121: Python Error (ImportError: No module named 
> )"*
> {code}
> $ /path/to/pig-0.10.0-RC1/bin/pig rctest.py 
> 2012-04-24 11:20:44,224 [main] INFO  org.apache.pig.Main - Apache Pig version 
> 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
> [...snip...]
> *sys-package-mgr*: can't create package cache dir, 
> '/path/to/pig-0.10.0-RC1/lib/cachedir/packages'
> 2012-04-24 11:20:44,816 [main] INFO  
> org.apache.pig.scripting.jython.JythonScriptEngine - created tmp 
> python.cachedir=/tmp/pig_jython_4081589571886870123
> 2012-04-24 11:20:45,033 [main] ERROR org.apache.pig.Main - ERROR 1121: Python 
> Error. Traceback (most recent call last):
>   File "/home/mnoll/pig10rc/rctest.py", line 5, in 
> import os
> ImportError: No module named os
> {code}
> In the Pig log file:
> {code}
> Error before Pig is launched
> 
> ERROR 1121: Python Error. Traceback (most recent call last):
>   File "/home/mnoll/pig10rc/rctest.py", line 5, in 
> import os
> ImportError: No module named os
> org.apache.pig.backend.executionengine.ExecException: ERROR 1121: Python 
> Error. Traceback (most recent call last):
>   File "/home/mnoll/pig10rc/rctest.py", line 5, in 
> import os
> ImportError: No module named os
> at 
> org.apache.pig.scripting.jython.JythonScriptEngine$Interpreter.execfile(JythonScriptEngine.java:210)
> at 
> org.apache.pig.scripting.jython.JythonScriptEngine.load(JythonScriptEngine.java:384)
> at 
> org.apache.pig.scripting.jython.JythonScriptEngine.main(JythonScriptEngine.java:368)
> at org.apache.pig.scripting.ScriptEngine.run(ScriptEngine.java:275)
> at org.apache.pig.Main.runEmbeddedScript(Main.java:929)
> at org.apache.pig.Main.run(Main.java:510)
> at org.apache.pig.Main.main(Main.java:111)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Caused by: Traceback (most recent call last):
> {code}
> *How to reproduce*
> Create a simple Python script that uses embedded Pig Latin AND that imports 
> Python standard modules (any import statement will work):
> {code}
> #!/usr/bin/python 
> from org.apache.pig.scripting import Pig 
> # this import statement will trigger the error;
> # remove it and everything will work fine
> import os
> if __name__ == "__main__":
> pig_script = """
> set job.name 'Pig 0.10.0-RC1 Python test';
> """
> P = Pig.compile(pig_script)
> bound = P.bind()
> result = bound.runSingle()
> if result.isSuccessful() :
> print "Pig job succeeded"
> else:
> raise "Pig job failed"
> {code}
> Then proceed as follows.
> {code}
> #
> # Install the Pig 0.10.0 release candidate [1].
> #
> # run the Python test script
> $ /path/to/pig-0.10.0-RC1/bin/pig rctest.py 
> #
> # see section above for error message
> #
> {code}
> *Test Environment*
> Apart from the "Environment" JIRA field please note that none of th

[jira] [Updated] (PIG-2736) Support implicit cast from bytearray to boolean

2012-06-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2736:


   Resolution: Fixed
Fix Version/s: 0.11
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks Jie!

> Support implicit cast from bytearray to boolean
> ---
>
> Key: PIG-2736
> URL: https://issues.apache.org/jira/browse/PIG-2736
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Jie Li
>Assignee: Jie Li
> Fix For: 0.11
>
> Attachments: PIG-2736.0.patch
>
>
>  
> {code}
> a = load 'allscalar10k' as (name, age, gpa, instate);
> b = filter a by instate==TRUE;
> dump b;   
> {code}
> Error:In alias b, incompatible types in Equal Operator left hand 
> side:bytearray right hand side:boolean
> Currently we only support implicit conversion from bytearray to chararray and 
> number types (int,long,float,double). We need to add support for boolean as 
> well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (PIG-2741) Python script throws an NameError: name 'Configuration' is not defined in case cache dir is not created

2012-06-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-2741.
-

   Resolution: Fixed
Fix Version/s: 0.10.1
   0.11
 Assignee: Koji Noguchi
 Hadoop Flags: Reviewed

Patch committed to both 0.10 branch and trunk. Thanks Koji!

> Python script throws an NameError: name 'Configuration' is not defined in 
> case cache dir is not created
> ---
>
> Key: PIG-2741
> URL: https://issues.apache.org/jira/browse/PIG-2741
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.10.0
> Environment: Pig 0.10
>Reporter: Viraj Bhat
>Assignee: Koji Noguchi
> Fix For: 0.11, 0.10.1
>
> Attachments: pig-2741-no-test-yet-v1.patch.txt, 
> pig-2741-testfailing-pig2665-v2.patch.txt, 
> pig-2741-testfailing-pig2665-v3.patch.txt, 
> pig-2741-testfailing-pig2665-v4.patch.txt, 
> pig-2741-testfailing-pig2665-v5.patch.txt, 
> pig-2741-testfailing-pig2665-v5.patch.txt, 
> pig-2741-testfailing-pig2665-v6.patch.txt
>
>
> I have a Python script which writes out data to HDFS
> {code}
> from org.apache.hadoop.conf import *
> from org.apache.hadoop.fs import *
> config = Configuration()
> hdfs = FileSystem.get(config)
> out = hdfs.create(Path("/user/viraj/junk.txt"))
> out.write("Hello World!")
> {code}
> When I run this I get the following error:
> {quote}
> 2012-06-06 01:20:43,101 [main] INFO  org.apache.pig.Main - Logging error 
> messages to: /home/viraj/pig_1338945643097.log
> 2012-06-06 01:20:43,502 [main] INFO  org.apache.pig.Main - Run embedded 
> script: jython
> 2012-06-06 01:20:43,603 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to hadoop file system at: hdfs://namenode:8020
> 2012-06-06 01:20:44,069 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to map-reduce job tracker at: jobtracker:50300
> *sys-package-mgr*: can't create package cache dir, '/mydir/xx'
> 2012-06-06 01:20:45,815 [main] INFO  
> org.apache.pig.scripting.jython.JythonScriptEngine - created tmp 
> python.cachedir=/tmp/pig_jython_7126458276821733512
> 2012-06-06 01:20:45,904 [main] ERROR org.apache.pig.Main - ERROR 1121: Python 
> Error. Traceback (most recent call last):
>   File "/homes/viraj/test.py", line 4, in 
> config = Configuration()
> NameError: name 'Configuration' is not defined
> {quote}
> I tried to solve it in various ways:
> 1) Override pig.properties to specify python.cachedir.skip=false but it does 
> not seem to work
> 2) The only workaround is to: specify: -Dpython.cachedir=/mydirectory/tmp on 
> the command line
> Viraj

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira