Thanks Stefan. I gave it a try. Could you or someone else comment on
the code and its performance?
I wrote a fairly ad-hoc dump of the 5900 data files into Jackrabbit.
Storing ~240 MB took roughly 3 minutes. Is this the expected time such
an operation takes? Is it possible to improve the performance somehow?
The code I used to persist data is given below. The pure IO time w/o
jackrabbit is ~1second w/ solid state disk.
Thanks for your comments,
Marcel
Mon Sep 26 15:39:05 CEST 2011: 200 units persisted. data 5 MB
Mon Sep 26 15:39:11 CEST 2011: 400 units persisted. data 13 MB
Mon Sep 26 15:39:21 CEST 2011: 600 units persisted. data 21 MB
Mon Sep 26 15:39:31 CEST 2011: 800 units persisted. data 28 MB
Mon Sep 26 15:39:35 CEST 2011: 1000 units persisted. data 33 MB
Mon Sep 26 15:39:40 CEST 2011: 1200 units persisted. data 42 MB
Mon Sep 26 15:39:44 CEST 2011: 1400 units persisted. data 49 MB
Mon Sep 26 15:39:50 CEST 2011: 1600 units persisted. data 57 MB
Mon Sep 26 15:39:54 CEST 2011: 1800 units persisted. data 65 MB
Mon Sep 26 15:39:58 CEST 2011: 2000 units persisted. data 72 MB
Mon Sep 26 15:40:10 CEST 2011: 2200 units persisted. data 88 MB
Mon Sep 26 15:40:15 CEST 2011: 2400 units persisted. data 94 MB
Mon Sep 26 15:40:22 CEST 2011: 2600 units persisted. data 102 MB
Mon Sep 26 15:40:26 CEST 2011: 2800 units persisted. data 107 MB
Mon Sep 26 15:40:30 CEST 2011: 3000 units persisted. data 113 MB
Mon Sep 26 15:40:36 CEST 2011: 3200 units persisted. data 123 MB
Mon Sep 26 15:40:40 CEST 2011: 3400 units persisted. data 129 MB
Mon Sep 26 15:40:45 CEST 2011: 3600 units persisted. data 136 MB
Mon Sep 26 15:40:48 CEST 2011: 3800 units persisted. data 140 MB
Mon Sep 26 15:40:58 CEST 2011: 4000 units persisted. data 143 MB
Mon Sep 26 15:41:18 CEST 2011: 4200 units persisted. data 154 MB
Mon Sep 26 15:41:24 CEST 2011: 4400 units persisted. data 164 MB
Mon Sep 26 15:41:38 CEST 2011: 4600 units persisted. data 185 MB
Mon Sep 26 15:41:43 CEST 2011: 4800 units persisted. data 193 MB
Mon Sep 26 15:41:50 CEST 2011: 5000 units persisted. data 204 MB
Mon Sep 26 15:41:56 CEST 2011: 5200 units persisted. data 211 MB
Mon Sep 26 15:42:00 CEST 2011: 5400 units persisted. data 218 MB
Mon Sep 26 15:42:05 CEST 2011: 5600 units persisted. data 226 MB
Mon Sep 26 15:42:10 CEST 2011: 5800 units persisted. data 235 MB
Mon Sep 26 15:42:15 CEST 2011: 5927 units persisted
public class JcrArtifactStoreTest {
private TransientRepository repository;
private Session session;
@Before
public void setup() throws RepositoryException {
final File basedir = new File("recommenders/").getAbsoluteFile();
basedir.mkdir();
repository = new TransientRepository(basedir);
session = repository.login(new SimpleCredentials("username",
"password".toCharArray()));
}
@Test
public void test2() throws ConfigurationException,
RepositoryException, IOException {
int i = 0;
int size = 0;
final Iterator<File> it = findDataFiles();
final Node rootNode = session.getRootNode();
while (it.hasNext()) {
final File file = it.next();
Node activeNode = rootNode;
for (final String segment : new
Path(file.getAbsolutePath()).segments()) {
activeNode = JcrUtils.getOrAddNode(activeNode, segment);
}
// System.out.println(activeNode.getPath());
final String content = Files.toString(file, Charsets.UTF_8);
size += content.getBytes().length;
activeNode.setProperty("cu", content);
if (++i % 200 == 0) {
session.save();
System.out.printf("%s: %d units persisted. data %s
\n", new Date(), i,
FileUtils.byteCountToDisplaySize(size));
}
}
session.save();
System.out.printf("%s: %d units persisted\n", new Date(), i);
}
@SuppressWarnings("unchecked")
private Iterator<File> findDataFiles() {
return FileUtils.iterateFiles(new
File("/Users/Marcel/Repositories/frankenberger-android-example-apps/"),
FileFilterUtils.suffixFileFilter(".json"), TrueFileFilter.TRUE);
}
2011/9/26 Stefan Guggisberg <[email protected]>:
> hi marcel,
>
> On Sun, Sep 25, 2011 at 3:40 PM, Marcel Bruch <[email protected]> wrote:
>> Hi,
>>
>> I'm looking for some advice whether Jackrabbit might be a good choice for my
>> problem. Any comments on this are greatly appreciated.
>>
>>
>> = Short description of the challenge =
>>
>> We've built a Eclipse based tool that analyzes java source files and stores
>> its analysis results in additional files. The workspace potentially has
>> hundreds of projects and each project may have up to a few thousands of
>> files. Say, there will be 200 projects and 1000 java source files per
>> project in a single workspace. Then, there will be 200*1000 = 200.000 files.
>>
>> On a full workspace build, all these 200k files have to be compiled (by the
>> IDE) and analyzed (by our tool) at once and the analysis results have to be
>> dumped to disk rather fast.
>> But the most common use case is that a single file is changed several times
>> per minute and thus gets frequently analyzed.
>>
>> At the moment, the analysis results are dumped on disk as plain json files;
>> one json file for each java class. Each json file is around 5 to 100kb in
>> size; some files grow up to several megabytes (<10mb), these files have a
>> few hundred JSON complex nodes (which might perfectly map to nodes in JCR).
>>
>> = Question =
>>
>> We would like to change the simple file system approach by a more
>> sophisticated approach and I wonder whether Jackrabbit may be a suitable
>> backend for this use case. Since we map all our data to JSON already, it
>> looks like Jackrabbit/JCR is a perfect fit for this but I can't say for sure.
>>
>> What's your suggestion? Is Jackrabbit capable to quickly load and store
>> json-like data - even if 200k files (nodes + their sub-nodes) have to be
>> updated very in very short time?
>
> absolutely. if the data is reasonably structured/organized jackrabbit
> should be a perfect fit.
> i suggest to leverage the java package space hierarchy for organizing the data
> (i.e. org.apache.jackrabbit.core.TransientRepository ->
> /org/apache/jackrabbit/core/TransientRepository).
> for further data modeling recommondations see [0].
>
> cheers
> stefan
>
> [0] http://wiki.apache.org/jackrabbit/DavidsModel
>
>>
>>
>> Thanks for your suggestions. I've you need more details on what operations
>> are performed or how data looks like, I would be glad to take your questions.
>>
>> Marcel
>>