Jeremy do you know the best approach here ?

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 11/05/2012, at 1:13 AM, Shawna Qian wrote:

> Hi 
> 
> Can I use sstableunsortedwriter to write the data directly to hdfs or I have 
> to use hdfs copyfromlocal to copy the sstable file from local dist to hdfs 
> after they get generated?
> 
> Thx
> Shawna
> 
> Sent from my iPhone
> 
> On May 7, 2012, at 3:48 AM, "aaron morton" <aa...@thelastpickle.com> wrote:
> 
>> Can you copy the sstables as a task after the load operation ? You should 
>> know where the files are. 
>> 
>> The are multiple files may be created by the writer during the loading 
>> process. So running code that performs a long running action will impact on 
>> the time taken to pump data through the SSTableSimpleUnsortedWriter.
>> 
>> wrt the patch, the best place to start the conversation for this is on 
>> https://issues.apache.org/jira/browse/CASSANDRA 
>> 
>> Thanks taking the time to look into this. 
>> 
>> Cheers
>> 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 3/05/2012, at 11:40 PM, Benoit Perroud wrote:
>> 
>>> Hi All,
>>> 
>>> I'm bulk loading (a lot of) data from Hadoop into Cassandra 1.0.x. The
>>> provided CFOutputFormat is not the best case here, I wanted to use the
>>> bulk loading feature. I know 1.1 comes with a BulkOutputFormat but I
>>> wanted to propose a simple enhancement to SSTableSimpleUnsortedWriter
>>> that could ease life :
>>> 
>>> When the table is flushed into the disk, it could be interesting to
>>> have listeners that could be triggered to perform any action (copying
>>> my SSTable into HDFS for instance).
>>> 
>>> Please have a look at the patch below to give a better idea. Do you
>>> think it could worth while opening a jira for this ?
>>> 
>>> 
>>> Regarding 1.1 BulkOutputFormat and bulk in general, the work done to
>>> have light client to stream into the cluster is really great. The
>>> issue now is that data is streamed at the end of the task only. This
>>> cause all the tasks storing the data locally and streaming everything
>>> at the end. Lot's of temporary space may be needed, and lot of
>>> bandwidth to the nodes are used at the "same" time. With the listener,
>>> we would be able to start streaming as soon the first table is
>>> created. That way the streaming bandwidth could be better balanced.
>>> Jira for this also ?
>>> 
>>> Thanks
>>> 
>>> Benoit.
>>> 
>>> 
>>> 
>>> 
>>> --- 
>>> a/src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java
>>> +++ 
>>> b/src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java
>>> @@ -21,6 +21,8 @@ package org.apache.cassandra.io.sstable;
>>> import java.io.File;
>>> import java.io.IOException;
>>> import java.nio.ByteBuffer;
>>> +import java.util.LinkedList;
>>> +import java.util.List;
>>> import java.util.Map;
>>> import java.util.TreeMap;
>>> 
>>> @@ -47,6 +49,8 @@ public class SSTableSimpleUnsortedWriter extends
>>> AbstractSSTableSimpleWriter
>>>     private final long bufferSize;
>>>     private long currentSize;
>>> 
>>> +    private final List<SSTableWriterListener> sSTableWrittenListeners
>>> = new LinkedList<SSTableWriterListener>();
>>> +
>>>     /**
>>>      * Create a new buffering writer.
>>>      * @param directory the directory where to write the sstables
>>> @@ -123,5 +127,16 @@ public class SSTableSimpleUnsortedWriter extends
>>> AbstractSSTableSimpleWriter
>>>         }
>>>         currentSize = 0;
>>>         keys.clear();
>>> +
>>> +        // Notify the registered listeners
>>> +        for (SSTableWriterListener listeners : sSTableWrittenListeners)
>>> +        {
>>> +
>>> listeners.onSSTableWrittenAndClosed(writer.getTableName(),
>>> writer.getColumnFamilyName(), writer.getFilename());
>>> +        }
>>> +    }
>>> +
>>> +    public void addSSTableWriterListener(SSTableWriterListener listener)
>>> +    {
>>> +       sSTableWrittenListeners.add(listener);
>>>     }
>>> }
>>> diff --git 
>>> a/src/java/org/apache/cassandra/io/sstable/SSTableWriterListener.java
>>> b/src/java/org/apache/cassandra/io/sstable/SSTableWriterListener.java
>>> new file mode 100644
>>> index 0000000..6628d20
>>> --- /dev/null
>>> +++ b/src/java/org/apache/cassandra/io/sstable/SSTableWriterListener.java
>>> @@ -0,0 +1,9 @@
>>> +package org.apache.cassandra.io.sstable;
>>> +
>>> +import java.io.IOException;
>>> +
>>> +public interface SSTableWriterListener {
>>> +
>>> +       void onSSTableWrittenAndClosed(final String tableName, final
>>> String columnFamilyName, final String filename) throws IOException;
>>> +
>>> +}
>> 

Reply via email to