Re: Using Jackrabbit/JCR as IDE workspace data backend

Marcel Bruch Mon, 26 Sep 2011 06:51:48 -0700

Thanks Stefan. I gave it a try. Could you or someone else comment on
the code and its performance?


I wrote a fairly ad-hoc dump of the 5900 data files into Jackrabbit.
Storing ~240 MB took roughly 3 minutes. Is this the expected time such
an operation takes? Is it possible to improve the performance somehow?

The code I used to persist data is given below. The pure IO time w/o
jackrabbit is ~1second w/ solid state disk.

Thanks for your comments,
Marcel

Mon Sep 26 15:39:05 CEST 2011: 200 units persisted.  data 5 MB
Mon Sep 26 15:39:11 CEST 2011: 400 units persisted.  data 13 MB
Mon Sep 26 15:39:21 CEST 2011: 600 units persisted.  data 21 MB
Mon Sep 26 15:39:31 CEST 2011: 800 units persisted.  data 28 MB
Mon Sep 26 15:39:35 CEST 2011: 1000 units persisted.  data 33 MB
Mon Sep 26 15:39:40 CEST 2011: 1200 units persisted.  data 42 MB
Mon Sep 26 15:39:44 CEST 2011: 1400 units persisted.  data 49 MB
Mon Sep 26 15:39:50 CEST 2011: 1600 units persisted.  data 57 MB
Mon Sep 26 15:39:54 CEST 2011: 1800 units persisted.  data 65 MB
Mon Sep 26 15:39:58 CEST 2011: 2000 units persisted.  data 72 MB
Mon Sep 26 15:40:10 CEST 2011: 2200 units persisted.  data 88 MB
Mon Sep 26 15:40:15 CEST 2011: 2400 units persisted.  data 94 MB
Mon Sep 26 15:40:22 CEST 2011: 2600 units persisted.  data 102 MB
Mon Sep 26 15:40:26 CEST 2011: 2800 units persisted.  data 107 MB
Mon Sep 26 15:40:30 CEST 2011: 3000 units persisted.  data 113 MB
Mon Sep 26 15:40:36 CEST 2011: 3200 units persisted.  data 123 MB
Mon Sep 26 15:40:40 CEST 2011: 3400 units persisted.  data 129 MB
Mon Sep 26 15:40:45 CEST 2011: 3600 units persisted.  data 136 MB
Mon Sep 26 15:40:48 CEST 2011: 3800 units persisted.  data 140 MB
Mon Sep 26 15:40:58 CEST 2011: 4000 units persisted.  data 143 MB
Mon Sep 26 15:41:18 CEST 2011: 4200 units persisted.  data 154 MB
Mon Sep 26 15:41:24 CEST 2011: 4400 units persisted.  data 164 MB
Mon Sep 26 15:41:38 CEST 2011: 4600 units persisted.  data 185 MB
Mon Sep 26 15:41:43 CEST 2011: 4800 units persisted.  data 193 MB
Mon Sep 26 15:41:50 CEST 2011: 5000 units persisted.  data 204 MB
Mon Sep 26 15:41:56 CEST 2011: 5200 units persisted.  data 211 MB
Mon Sep 26 15:42:00 CEST 2011: 5400 units persisted.  data 218 MB
Mon Sep 26 15:42:05 CEST 2011: 5600 units persisted.  data 226 MB
Mon Sep 26 15:42:10 CEST 2011: 5800 units persisted.  data 235 MB
Mon Sep 26 15:42:15 CEST 2011: 5927 units persisted


public class JcrArtifactStoreTest {

    private TransientRepository repository;
    private Session session;

    @Before
    public void setup() throws RepositoryException {

        final File basedir = new File("recommenders/").getAbsoluteFile();
        basedir.mkdir();
        repository = new TransientRepository(basedir);
        session = repository.login(new SimpleCredentials("username",
"password".toCharArray()));
    }

    @Test
    public void test2() throws ConfigurationException,
RepositoryException, IOException {

        int i = 0;
        int size = 0;
        final Iterator<File> it = findDataFiles();
        final Node rootNode = session.getRootNode();

        while (it.hasNext()) {
            final File file = it.next();
            Node activeNode = rootNode;
            for (final String segment : new
Path(file.getAbsolutePath()).segments()) {
                activeNode = JcrUtils.getOrAddNode(activeNode, segment);
            }
            // System.out.println(activeNode.getPath());
            final String content = Files.toString(file, Charsets.UTF_8);
            size += content.getBytes().length;
            activeNode.setProperty("cu", content);
            if (++i % 200 == 0) {
                session.save();
                System.out.printf("%s: %d units persisted.  data %s
\n", new Date(), i,
                        FileUtils.byteCountToDisplaySize(size));
            }
        }
        session.save();
        System.out.printf("%s: %d units persisted\n", new Date(), i);
    }

    @SuppressWarnings("unchecked")
    private Iterator<File> findDataFiles() {
        return FileUtils.iterateFiles(new
File("/Users/Marcel/Repositories/frankenberger-android-example-apps/"),
                FileFilterUtils.suffixFileFilter(".json"), TrueFileFilter.TRUE);
    }




2011/9/26 Stefan Guggisberg <[email protected]>:
> hi marcel,
>
> On Sun, Sep 25, 2011 at 3:40 PM, Marcel Bruch <[email protected]> wrote:
>> Hi,
>>
>> I'm looking for some advice whether Jackrabbit might be a good choice for my 
>> problem. Any comments on this are greatly appreciated.
>>
>>
>> = Short description of the challenge =
>>
>> We've built a Eclipse based tool that analyzes java source files and stores 
>> its analysis results in additional files. The workspace  potentially has 
>> hundreds of projects and each project may have up to a few thousands of 
>> files. Say, there will be 200 projects and 1000 java source files per 
>> project in a single workspace. Then, there will be 200*1000 = 200.000 files.
>>
>> On a full workspace build, all these 200k files have to be compiled (by the 
>> IDE) and analyzed (by our tool) at once and the analysis results have to be 
>> dumped to disk rather fast.
>> But the most common use case is that a single file is changed several times 
>> per minute and thus gets frequently analyzed.
>>
>> At the moment, the analysis results are dumped on disk as plain json files; 
>> one json file for each java class. Each json file is around 5 to 100kb in 
>> size; some files grow up to several megabytes (<10mb), these files have a 
>> few hundred JSON complex nodes (which might perfectly map to nodes in JCR).
>>
>> = Question =
>>
>> We would like to change the simple file system approach by a more 
>> sophisticated approach and I wonder whether Jackrabbit may be a suitable 
>> backend for this use case. Since we map all our data to JSON already, it 
>> looks like Jackrabbit/JCR is a perfect fit for this but I can't say for sure.
>>
>> What's your suggestion? Is Jackrabbit capable to quickly load and store 
>> json-like data - even if 200k files (nodes + their sub-nodes) have to be 
>> updated very in very short time?
>
> absolutely. if the data is reasonably structured/organized jackrabbit
> should be a perfect fit.
> i suggest to leverage the java package space hierarchy for organizing the data
> (i.e. org.apache.jackrabbit.core.TransientRepository ->
> /org/apache/jackrabbit/core/TransientRepository).
> for further data modeling recommondations see [0].
>
> cheers
> stefan
>
> [0] http://wiki.apache.org/jackrabbit/DavidsModel
>
>>
>>
>> Thanks for your suggestions. I've you need more details on what operations 
>> are performed or how data looks like, I would be glad to take your questions.
>>
>> Marcel
>>

Re: Using Jackrabbit/JCR as IDE workspace data backend

Reply via email to