date:20151014

[jira] [Updated] (PIG-4697) Serialize relevant part of the udfcontext per vertex to reduce payload size

2015-10-14 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4697:

Description: 
  What HCatLoader/HCatStorer puts in UDFContext is huge and if there are 
multiple of them in the pig script, the size of data sent to Tez AM is huge and 
also the size of data that Tez AM sends to tasks is huge causing RPC limit 
exceeded and OOM issues respectively.  If Pig serializes only part of the 
udfcontext that is required for each vertex, it will save a lot.  HCat folks 
are also looking up at cleaning what goes into the conf (it ends up serializing 
whole job conf, not just hive-site.xml) and moving out the common part to be 
shared by all hcat loaders and stores. 

Also looking at other options for faster and compact serialization. Will create 
separate jiras for that. Will use PIG-4653 to cleanup all other pig config 
other than udfcontext.

  was:
  What HCatLoader/HCatStorer put in UDFContext is huge and if there are 
multiple of them in the pig script, the size of data sent to Tez AM is huge and 
the size of data that Tez AM to tasks is huge and causing either RPC limit 
exceeded or OOM issues.  If Pig serializes only part of the udfcontext that is 
required for each vertex, it will save a lot.  HCat folks are also looking up 
at cleaning what goes into the conf (it ends up serializing whole job conf, not 
just hive-site.xml) and moving out the common part to be shared by all hcat 
loaders and stores. 

Also looking at other options for faster and compact serialization. Will create 
separate jiras for that. Will use PIG-4653 to cleanup all other pig config 
other than udfcontext.

Summary: Serialize relevant part of the udfcontext per vertex to reduce 
payload size  (was: Pig needs to serialize only part of the udfcontext for each 
vertex)

> Serialize relevant part of the udfcontext per vertex to reduce payload size
> ---
>
> Key: PIG-4697
> URL: https://issues.apache.org/jira/browse/PIG-4697
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4697-1.patch
>
>
>   What HCatLoader/HCatStorer puts in UDFContext is huge and if there are 
> multiple of them in the pig script, the size of data sent to Tez AM is huge 
> and also the size of data that Tez AM sends to tasks is huge causing RPC 
> limit exceeded and OOM issues respectively.  If Pig serializes only part of 
> the udfcontext that is required for each vertex, it will save a lot.  HCat 
> folks are also looking up at cleaning what goes into the conf (it ends up 
> serializing whole job conf, not just hive-site.xml) and moving out the common 
> part to be shared by all hcat loaders and stores. 
> Also looking at other options for faster and compact serialization. Will 
> create separate jiras for that. Will use PIG-4653 to cleanup all other pig 
> config other than udfcontext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4697) Pig needs to serialize only part of the udfcontext for each vertex

2015-10-14 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4697:

Status: Patch Available  (was: Open)

> Pig needs to serialize only part of the udfcontext for each vertex
> --
>
> Key: PIG-4697
> URL: https://issues.apache.org/jira/browse/PIG-4697
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4697-1.patch
>
>
>   What HCatLoader/HCatStorer put in UDFContext is huge and if there are 
> multiple of them in the pig script, the size of data sent to Tez AM is huge 
> and the size of data that Tez AM to tasks is huge and causing either RPC 
> limit exceeded or OOM issues.  If Pig serializes only part of the udfcontext 
> that is required for each vertex, it will save a lot.  HCat folks are also 
> looking up at cleaning what goes into the conf (it ends up serializing whole 
> job conf, not just hive-site.xml) and moving out the common part to be shared 
> by all hcat loaders and stores. 
> Also looking at other options for faster and compact serialization. Will 
> create separate jiras for that. Will use PIG-4653 to cleanup all other pig 
> config other than udfcontext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4697) Pig needs to serialize only part of the udfcontext for each vertex

2015-10-14 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4697:

Attachment: PIG-4697-1.patch

Changes done:
   - Only serialize LoadFuncs in MRInput payload
   - Only serialize StoreFuncs in MROutput payload
   - Serialize all LoadFunc, StoreFunc, EvalFunc in Processor payload
   - Serialize only EvalFunc in edge payload for Combiner
   - If we cannot match a UDFContextKey to the plan of any of the vertices, we 
add it everywhere.
   - Keep a local copy of serialized PigContext, udf.import.list and tez plan 
instead of serializing again and again.
   - Make a copy of payLoadConf earlier for MRInput/MROutput payload and avoid 
having lot of unnecessary stuff being copied over.


> Pig needs to serialize only part of the udfcontext for each vertex
> --
>
> Key: PIG-4697
> URL: https://issues.apache.org/jira/browse/PIG-4697
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4697-1.patch
>
>
>   What HCatLoader/HCatStorer put in UDFContext is huge and if there are 
> multiple of them in the pig script, the size of data sent to Tez AM is huge 
> and the size of data that Tez AM to tasks is huge and causing either RPC 
> limit exceeded or OOM issues.  If Pig serializes only part of the udfcontext 
> that is required for each vertex, it will save a lot.  HCat folks are also 
> looking up at cleaning what goes into the conf (it ends up serializing whole 
> job conf, not just hive-site.xml) and moving out the common part to be shared 
> by all hcat loaders and stores. 
> Also looking at other options for faster and compact serialization. Will 
> create separate jiras for that. Will use PIG-4653 to cleanup all other pig 
> config other than udfcontext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Jenkins build became unstable: Pig-trunk-commit #2248

2015-10-14 Thread Apache Jenkins Server

See

[jira] Subscription: PIG patch available

2015-10-14 Thread jira

Issue Subscription
Filter: PIG patch available (28 issues)

Subscriber: pigdaily

Key Summary
PIG-4702Load once for sampling and partitioning in order by for certain 
LoadFuncs
https://issues.apache.org/jira/browse/PIG-4702
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4677Display failure information on stop on failure
https://issues.apache.org/jira/browse/PIG-4677
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4641Print the instance of Object without using toString()
https://issues.apache.org/jira/browse/PIG-4641
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4581thread safe issue in NodeIdGenerator
https://issues.apache.org/jira/browse/PIG-4581
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4468Pig's jackson version conflicts with that of hadoop 2.6.0
https://issues.apache.org/jira/browse/PIG-4468
PIG-4455Should use DependencyOrderWalker instead of DepthFirstWalker in 
MRPrinter
https://issues.apache.org/jira/browse/PIG-4455
PIG-4417Pig's register command should support automatic fetching of jars 
from repo.
https://issues.apache.org/jira/browse/PIG-4417
PIG-4373Implement PIG-3861 in Tez
https://issues.apache.org/jira/browse/PIG-4373
PIG-4341Add CMX support to pig.tmpfilecompression.codec
https://issues.apache.org/jira/browse/PIG-4341
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4111Make Pig compiles with avro-1.7.7
https://issues.apache.org/jira/browse/PIG-4111
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3866Create ThreadLocal classloader per PigContext
https://issues.apache.org/jira/browse/PIG-3866
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384

Re: Threshold for errors in STORE

2015-10-14 Thread Prashant Kommireddi

The proposed approach sounds good. If there are no objections, can you
please go ahead and file a JIRA. I can take a look once you have a patch
available.

On Wed, Oct 14, 2015 at 2:20 PM, Siddhi Mehta  wrote:

> Sending to the pig developers group
>
> On Wed, Oct 14, 2015 at 2:17 PM, Siddhi Mehta 
> wrote:
>
> > Hello Everyone,
> >
> > Just wanted to follow up on the my earlier post and see if there are any
> > thoughts around the same.
> > I was planning to take a stab to implement the same.
> >
> > The approach I was planning to use for the same is
> > 1. Make the storer that wants error handling capability implement an
> > interface(ErrorHandlingStoreFunc).
> > 2. Using this interface the storer can define if the thresholds for
> > error.Each store func can determine what the threshold should be.For
> > example HbaseStorage can have a different threshold from ParquetStorage.
> > 3. Whenever the storer gets created in
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc()
> > we intercept the called and give it a wrappedStoreFunc
> > 4. Every put next calls now gets delegated to the actual storer via the
> > delegate and we can listen in for error on putNext() and take care of the
> > allowing the error  if within threshold or re throwing from there.
> > 5. The client can get information about the threshold value from  the
> > counters to know if there was any data dropped.
> >
> > Thougts?
> >
> > Thanks,
> > Siddhi
> >
> >
> > On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta 
> > wrote:
> >
> >> Hey Guys,
> >>
> >> Currently a Pig job fails when one record out of the billions records
> >> fails on STORE.
> >> This is not always desirable behavior when you are dealing with millions
> >> of records and only few fail.
> >> In certain use-cases its desirable to know how many such errors and have
> >> an accounting for the same.
> >> Is there a configurable limits that we can set for pig so that we can
> >> allow a threshold for bad records on STORE similar to the lines of the
> JIRA
> >> for LOAD PIG-3059 
> >>
> >> Thanks,
> >> Siddhi
> >>
> >
> >
>

[jira] [Updated] (PIG-4702) Load once for sampling and partitioning in order by for certain LoadFuncs

2015-10-14 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4702:

Attachment: PIG-4702-1.patch

Also fixed Native_2 e2e test which was failing after PIG-4574.

> Load once for sampling and partitioning in order by for certain LoadFuncs
> -
>
> Key: PIG-4702
> URL: https://issues.apache.org/jira/browse/PIG-4702
> Project: Pig
>  Issue Type: Improvement
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4702-1.patch
>
>
>For HBase and Accumulo, it will be more efficient on IO to have the data 
> written to disk instead of reading from them again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4702) Load once for sampling and partitioning in order by for certain LoadFuncs

2015-10-14 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4702:

Component/s: tez

> Load once for sampling and partitioning in order by for certain LoadFuncs
> -
>
> Key: PIG-4702
> URL: https://issues.apache.org/jira/browse/PIG-4702
> Project: Pig
>  Issue Type: Improvement
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4702-1.patch
>
>
>For HBase and Accumulo, it will be more efficient on IO to have the data 
> written to disk instead of reading from them again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4702) Load once for sampling and partitioning in order by for certain LoadFuncs

2015-10-14 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4702:

Status: Patch Available  (was: Open)

> Load once for sampling and partitioning in order by for certain LoadFuncs
> -
>
> Key: PIG-4702
> URL: https://issues.apache.org/jira/browse/PIG-4702
> Project: Pig
>  Issue Type: Improvement
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4702-1.patch
>
>
>For HBase and Accumulo, it will be more efficient on IO to have the data 
> written to disk instead of reading from them again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4702) Load once for sampling and partitioning in order by for certain LoadFuncs

2015-10-14 Thread Rohini Palaniswamy (JIRA)

Rohini Palaniswamy created PIG-4702:
---

 Summary: Load once for sampling and partitioning in order by for 
certain LoadFuncs
 Key: PIG-4702
 URL: https://issues.apache.org/jira/browse/PIG-4702
 Project: Pig
  Issue Type: Improvement
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.16.0


   For HBase and Accumulo, it will be more efficient on IO to have the data 
written to disk instead of reading from them again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Threshold for errors in STORE

2015-10-14 Thread Siddhi Mehta

Sending to the pig developers group

On Wed, Oct 14, 2015 at 2:17 PM, Siddhi Mehta  wrote:

> Hello Everyone,
>
> Just wanted to follow up on the my earlier post and see if there are any
> thoughts around the same.
> I was planning to take a stab to implement the same.
>
> The approach I was planning to use for the same is
> 1. Make the storer that wants error handling capability implement an
> interface(ErrorHandlingStoreFunc).
> 2. Using this interface the storer can define if the thresholds for
> error.Each store func can determine what the threshold should be.For
> example HbaseStorage can have a different threshold from ParquetStorage.
> 3. Whenever the storer gets created in
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc()
> we intercept the called and give it a wrappedStoreFunc
> 4. Every put next calls now gets delegated to the actual storer via the
> delegate and we can listen in for error on putNext() and take care of the
> allowing the error  if within threshold or re throwing from there.
> 5. The client can get information about the threshold value from  the
> counters to know if there was any data dropped.
>
> Thougts?
>
> Thanks,
> Siddhi
>
>
> On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta 
> wrote:
>
>> Hey Guys,
>>
>> Currently a Pig job fails when one record out of the billions records
>> fails on STORE.
>> This is not always desirable behavior when you are dealing with millions
>> of records and only few fail.
>> In certain use-cases its desirable to know how many such errors and have
>> an accounting for the same.
>> Is there a configurable limits that we can set for pig so that we can
>> allow a threshold for bad records on STORE similar to the lines of the JIRA
>> for LOAD PIG-3059 
>>
>> Thanks,
>> Siddhi
>>
>
>

[jira] [Updated] (PIG-4693) Class conflicts: Kryo bundled in spark vs kryo bundled with pig

2015-10-14 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-4693:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Srikanth!

> Class conflicts: Kryo bundled in spark vs kryo bundled with pig
> ---
>
> Key: PIG-4693
> URL: https://issues.apache.org/jira/browse/PIG-4693
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Affects Versions: spark-branch
>Reporter: Srikanth Sundarrajan
>Assignee: Srikanth Sundarrajan
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4693.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4701) Set alias and feature on all vertices

2015-10-14 Thread Rohini Palaniswamy (JIRA)

Rohini Palaniswamy created PIG-4701:
---

 Summary: Set alias and feature on all vertices
 Key: PIG-4701
 URL: https://issues.apache.org/jira/browse/PIG-4701
 Project: Pig
  Issue Type: Bug
Reporter: Rohini Palaniswamy


  While working on PIG-4699, saw that alias, alias location and feature not set 
on some vertices. Need to track those down (e2e test log is an easy place to 
find them all) and have those properly set there so that it is easy for 
debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4700) Pig should call ProcessorContext.setProgress() in TezTaskContext

2015-10-14 Thread Rohini Palaniswamy (JIRA)

Rohini Palaniswamy created PIG-4700:
---

 Summary: Pig should call ProcessorContext.setProgress() in 
TezTaskContext
 Key: PIG-4700
 URL: https://issues.apache.org/jira/browse/PIG-4700
 Project: Pig
  Issue Type: Bug
Reporter: Rohini Palaniswamy
 Fix For: 0.16.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4699) Print Job stats information in Tez like mapreduce

2015-10-14 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4699:

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks for the review Daniel.

> Print Job stats information in Tez like mapreduce
> -
>
> Key: PIG-4699
> URL: https://issues.apache.org/jira/browse/PIG-4699
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4699-1.patch, sample-output.txt
>
>
>Job stats information in mapreduce is extremely useful while debugging or 
> looking at performance bottlenecks on which of the mapreduce jobs is taking 
> time. It is hard to figure out the same and what aliases are being processed 
> in vertices of Tez without that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Dependency version on Kryo

2015-10-14 Thread Rohini Palaniswamy

That should be fine. We wanted to get rid of the kryo dependency in ORC and
use the shaded one that hive uses. But that is in hive-exec jar which is
huge and has too many other jars packed in and did not want to add that as
dependency to Pig.

On Mon, Oct 12, 2015 at 8:54 PM, Xuefu Zhang  wrote:

> Hi all,
>
> It was found in PIG-4693 (https://issues.apache.org/jira/browse/PIG-4693)
> that Pig is currently dependent on Kryo 2.22. However, Spark depends on
> 2.21. The two versions are not completely compatible. We tried several ways
> to solve the problem but unfortunately none worked. This is mainly because
> Spark doesn't give user an opportunity to provide their own kryo library
> (SPARK-10910). Please refer to the full discussions in PIG-4693.
>
> It seems that Pig brought in kryo dependency for ORC. I'm wondering if
> there is any specific reasons for kryo 2.22 and if not, whether we can
> downgrade the dependency to 2.21 instead. Our initial test shows that kryo
> 2.21 works just fine for ORC. This obviously solve our problem as well.
>
> Your input to this is greatly appreciated.
>
> Thanks,
> Xuefu
>

[jira] [Commented] (PIG-4689) CSV Writes incorrect header if two CSV files are created in one script

2015-10-14 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957300#comment-14957300
 ] 

Rohini Palaniswamy commented on PIG-4689:
-

bq. Possibly adding the new method is better (setting the same value).
   Yes. You need both. setUDFContextSignature is for LoadFunc and 
setStoreFuncUDFContextSignature is for StoreFunc. 

> CSV Writes incorrect header if two CSV files are created in one script
> --
>
> Key: PIG-4689
> URL: https://issues.apache.org/jira/browse/PIG-4689
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.14.0, 0.15.0
>Reporter: Niels Basjes
>Assignee: Niels Basjes
> Attachments: PIG-4689-2015-10-06.patch
>
>
> From a single Pig script I write two completely different and unrelated CSV 
> files; both with the flag 'WRITE_OUTPUT_HEADER'.
> The bug is that both files get the SAME header at the top of the output file 
> even though the data is different.
> *Reproduction:*
> {code:title=foo.txt}
> 1
> {code}
> {code:title=bar.txt (Tab separated)}
> 1 a
> {code}
> {code:title=WriteTwoCSV.pig}
> FOO =
> LOAD 'foo.txt'
> USING PigStorage('\t')
> AS (a:chararray);
> BAR =
> LOAD 'bar.txt'
> USING PigStorage('\t')
> AS (b:chararray, c:chararray);
> STORE FOO into 'Foo'
> USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t','NO_MULTILINE', 
> 'UNIX', 'WRITE_OUTPUT_HEADER');
> STORE BAR into 'Bar'
> USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t','NO_MULTILINE', 
> 'UNIX', 'WRITE_OUTPUT_HEADER');
> {code}
> *Command:*
> {quote}pig -x local WriteTwoCSV.pig{quote}
> *Result:*
> {quote}cat Bar/part-*{quote}
> {code}
> b c
> 1 a
> {code}
> {quote}cat Foo/part-*{quote}
> {code}
> b c
> 1
> {code}
> *The error is that the {{Foo}} output has a the two column header from the 
> {{Bar}} output.*
> *One of the effects is that parsing the {{Foo}} data will probably fail due 
> to the varying number of columns*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4689) CSV Writes incorrect header if two CSV files are created in one script

2015-10-14 Thread Niels Basjes (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated PIG-4689:
--
Status: Open  (was: Patch Available)

Needs rework

> CSV Writes incorrect header if two CSV files are created in one script
> --
>
> Key: PIG-4689
> URL: https://issues.apache.org/jira/browse/PIG-4689
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.15.0, 0.14.0
>Reporter: Niels Basjes
>Assignee: Niels Basjes
> Attachments: PIG-4689-2015-10-06.patch
>
>
> From a single Pig script I write two completely different and unrelated CSV 
> files; both with the flag 'WRITE_OUTPUT_HEADER'.
> The bug is that both files get the SAME header at the top of the output file 
> even though the data is different.
> *Reproduction:*
> {code:title=foo.txt}
> 1
> {code}
> {code:title=bar.txt (Tab separated)}
> 1 a
> {code}
> {code:title=WriteTwoCSV.pig}
> FOO =
> LOAD 'foo.txt'
> USING PigStorage('\t')
> AS (a:chararray);
> BAR =
> LOAD 'bar.txt'
> USING PigStorage('\t')
> AS (b:chararray, c:chararray);
> STORE FOO into 'Foo'
> USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t','NO_MULTILINE', 
> 'UNIX', 'WRITE_OUTPUT_HEADER');
> STORE BAR into 'Bar'
> USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t','NO_MULTILINE', 
> 'UNIX', 'WRITE_OUTPUT_HEADER');
> {code}
> *Command:*
> {quote}pig -x local WriteTwoCSV.pig{quote}
> *Result:*
> {quote}cat Bar/part-*{quote}
> {code}
> b c
> 1 a
> {code}
> {quote}cat Foo/part-*{quote}
> {code}
> b c
> 1
> {code}
> *The error is that the {{Foo}} output has a the two column header from the 
> {{Bar}} output.*
> *One of the effects is that parsing the {{Foo}} data will probably fail due 
> to the varying number of columns*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4697) Serialize relevant part of the udfcontext per vertex to reduce payload size

[jira] [Updated] (PIG-4697) Pig needs to serialize only part of the udfcontext for each vertex

[jira] [Updated] (PIG-4697) Pig needs to serialize only part of the udfcontext for each vertex

Jenkins build became unstable: Pig-trunk-commit #2248

[jira] Subscription: PIG patch available

Re: Threshold for errors in STORE

[jira] [Updated] (PIG-4702) Load once for sampling and partitioning in order by for certain LoadFuncs

[jira] [Updated] (PIG-4702) Load once for sampling and partitioning in order by for certain LoadFuncs

[jira] [Updated] (PIG-4702) Load once for sampling and partitioning in order by for certain LoadFuncs

[jira] [Created] (PIG-4702) Load once for sampling and partitioning in order by for certain LoadFuncs

Re: Threshold for errors in STORE

[jira] [Updated] (PIG-4693) Class conflicts: Kryo bundled in spark vs kryo bundled with pig

[jira] [Created] (PIG-4701) Set alias and feature on all vertices

[jira] [Created] (PIG-4700) Pig should call ProcessorContext.setProgress() in TezTaskContext

[jira] [Updated] (PIG-4699) Print Job stats information in Tez like mapreduce

Re: Dependency version on Kryo

[jira] [Commented] (PIG-4689) CSV Writes incorrect header if two CSV files are created in one script

[jira] [Updated] (PIG-4689) CSV Writes incorrect header if two CSV files are created in one script

18 matches

Site Navigation

Mail list logo

Footer information