Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan Sparks
We have gotten this to work, but it requires instantiating the CoreNLP object on the worker side. Because of the initialization time it makes a lot of sense to do this inside of a .mapPartitions instead of a .map, for example. As an aside, if you're using it from Scala, have a look at

Re: MLlib linking error Mac OS X

2014-10-20 Thread Evan Sparks
MLlib relies on breeze for much of its linear algebra, which in turn relies on netlib-java. netlib-java will attempt to load a native BLAS at runtime and then attempt to load it's own precompiled version. Failing that, it will default back to a Java version that it has built in. The Java

Re: Spark speed performance

2014-10-18 Thread Evan Sparks
How many files do you have and how big is each JSON object? Spark works better with a few big files vs many smaller ones. So you could try cat'ing your files together and rerunning the same experiment. - Evan On Oct 18, 2014, at 12:07 PM, jan.zi...@centrum.cz jan.zi...@centrum.cz wrote:

Re: Problem reading from S3 in standalone application

2014-08-06 Thread Evan Sparks
Try s3n:// On Aug 6, 2014, at 12:22 AM, sparkuser2345 hm.spark.u...@gmail.com wrote: I'm getting the same Input path does not exist error also after setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using the format s3://bucket-name/test_data.txt for the

Re: MLLib sample data format

2014-06-22 Thread Evan Sparks
These files follow the libsvm format where each line is a record, the first column is a label, and then after that the fields are offset:value where offset is the offset into the feature vector, and value is the value of the input feature. This is a fairly efficient representation for sparse