[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1229:
----------------------------------

    Attachment: pig-1229.patch

Ankur,

Sorry for getting back late on this. I fiddled with your latest patch and was 
able to make some progress on it. I am able to get rid of those Path problems 
(looks like Pig itself is not dealing with it correctly at one place). I think 
with the patch that I attached should work but I am not able to get test case 
to pass because of hsqldb problem which I am not able to resolve. I keep 
getting this error from it:
{noformat}
Caused by: java.sql.SQLException: The database is already in use by another 
process: org.hsqldb.persist.niolockf...@4abea04e[file 
=/private/tmp/batchtest.lck, exists=true, locked=false, valid=false, fl =null]: 
java.lang.Exception: checkHeartbeat(): lock file [/private/tmp/batchtest.lck] 
is presumably locked by another process.
        at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
        at org.hsqldb.jdbc.jdbcConnection.<init>(Unknown Source)
        at org.hsqldb.jdbcDriver.getConnection(Unknown Source)
        at org.hsqldb.jdbcDriver.connect(Unknown Source)
        at java.sql.DriverManager.getConnection(DriverManager.java:582)
        at java.sql.DriverManager.getConnection(DriverManager.java:185)
        at 
org.apache.pig.piggybank.storage.DBStorage.prepareToWrite(DBStorage.java:274)

{noformat}
Anyways here are the changes I made:
1.
{code}
Index:src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
===================================================================
-                conf.set("pig.streaming.log.dir", 
-                            new Path(outputPath, LOG_DIR).toString());
+//                conf.set("pig.streaming.log.dir", 
+//                            new Path(outputPath, LOG_DIR).toString());
                 conf.set("pig.streaming.task.output.dir", outputPath);
             }
{code}
This looks like a problem in Pig. Here Pig is incorrectly assuming that it can 
put logs generated during stream command in output location which is incorrect 
if output location is something like DB. Since this needs changes in main Pig 
code, I will suggest to open new jira for it and track it there.

2. Then in DBStorage.java
{code}
@Override
public void setStoreLocation(String location, Job job) throws IOException {
          job.getConfiguration().set("pig.db.conn.string", location);
}
@Override
public RecordWriter<NullWritable, NullWritable> getRecordWriter(
    TaskAttemptContext context) throws IOException, InterruptedException {
  jdbcURL = context.getConfiguration().get("pig.db.conn.string");
  return null;
}
{code} 
Need to save db connection string in job in setStoreLocation() and then 
retrieve it in backend in getRecordWriter(). 

3. In DBStorage.java
{code}
@Override
        public void cleanupOnFailure(String location, Job job) throws 
IOException {
          log.error("Job has failed.");
        }
{code}
You need to necessarily override this function of StoreFunc() as default 
implementation assumes FileSystem as the output location. Currently, I left it 
as no-op but it can be improved to do rollbacks, release db connections etc. 


> allow pig to write output into a JDBC db
> ----------------------------------------
>
>                 Key: PIG-1229
>                 URL: https://issues.apache.org/jira/browse/PIG-1229
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Ian Holsman
>            Assignee: Ankur
>            Priority: Minor
>             Fix For: 0.8.0
>
>         Attachments: jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to