Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan Sparks
We have gotten this to work, but it requires instantiating the CoreNLP object 
on the worker side. Because of the initialization time it makes a lot of sense 
to do this inside of a .mapPartitions instead of a .map, for example. 

As an aside, if you're using it from Scala, have a look at sistanlp, which 
provided a nicer, scala-friendly interface to CoreNLP. 


 On Nov 24, 2014, at 7:46 AM, tvas theodoros.vasilou...@gmail.com wrote:
 
 Hello,
 
 I was wondering if anyone has gotten the Stanford CoreNLP Java library to
 work with Spark.
 
 My attempts to use the parser/annotator fail because of task serialization
 errors since the class 
 StanfordCoreNLP cannot be serialized.
 
 I've tried the remedies of registering StanfordCoreNLP through kryo, as well
 as using chill.MeatLocker,
 but these still produce serialization errors.
 Passing the StanfordCoreNLP object as transient leads to a
 NullPointerException instead.
 
 Has anybody managed to get this work?
 
 Regards,
 Theodore
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib linking error Mac OS X

2014-10-20 Thread Evan Sparks
MLlib relies on breeze for much of its linear algebra, which in turn relies on 
netlib-java. netlib-java will attempt to load a native BLAS at runtime and then 
attempt to load it's own precompiled version. Failing that, it will default 
back to a Java version that it has built in. The Java version can be about as 
fast as a native version for certain operations that are tricky to optimize 
(like vector dot products), but MUCH slower for things like matrix/matrix 
multiply. Luckily - the code will still work without the native libraries 
installed, it will just be slower in some situations. So, you can safely ignore 
the warnings if all you care about is correctness. 

The MLlib docs (https://spark.apache.org/docs/latest/mllib-guide.html) provide 
guidance about how to link against the native libraries in your application - 
this will make the warning messages go away and might speed up your program.

- Evan

 On Oct 20, 2014, at 3:54 AM, npomfret nick-nab...@snowmonkey.co.uk wrote:
 
 I'm getting the same warning on my mac.  Accompanied by what appears to be
 pretty low CPU usage
 (http://apache-spark-user-list.1001560.n3.nabble.com/mlib-model-build-and-low-CPU-usage-td16777.html),
 I wonder if they are connected?
 
 I've used jblas on a mac several times, it always just works perfectly with
 zero setup.  Maybe the warning is misleading.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-linking-error-Mac-OS-X-tp588p16806.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark speed performance

2014-10-18 Thread Evan Sparks
How many files do you have and how big is each JSON object?

Spark works better with a few big files vs many smaller ones. So you could try 
cat'ing your files together and rerunning the same experiment. 

- Evan


 On Oct 18, 2014, at 12:07 PM, jan.zi...@centrum.cz jan.zi...@centrum.cz 
 wrote:
 
 Hi,
 
 I have program that I have for single computer (in Python) exection and also 
 implemented the same for Spark. This program basically only reads .json from 
 which it takes one field and saves it back. Using Spark my program runs 
 aproximately 100 times slower on 1 master and 1 slave. So I would like to ask 
 where possibly might be the problem?
 
 My Spark program looks like:
  
 sc = SparkContext(appName=Json data preprocessor)
 distData = sc.textFile(sys.argv[2])
 json_extractor = JsonExtractor(sys.argv[1])
 cleanedData = distData.flatMap(json_extractor.extract_json) 
 cleanedData.saveAsTextFile(sys.argv[3])
 
 JsonExtractor only selects the data from field that is given by sys.argv[1].
  
 My data are basically many small one json files, where is one json per line.
 
 I have tried both, reading and writing the data from/to Amazon S3, local disc 
 on all the machines.
 
 I would like to ask if there is something that I am missing or if Spark is 
 supposed to be so slow in comparison with the local non parallelized single 
 node program. 
  
 Thank you in advance for any suggestions or hints.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


Re: Problem reading from S3 in standalone application

2014-08-06 Thread Evan Sparks
Try s3n://

 On Aug 6, 2014, at 12:22 AM, sparkuser2345 hm.spark.u...@gmail.com wrote:
 
 I'm getting the same Input path does not exist error also after setting the
 AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using
 the format s3://bucket-name/test_data.txt  for the input file. 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Problem-reading-from-S3-in-standalone-application-tp11524p11526.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLLib sample data format

2014-06-22 Thread Evan Sparks
These files follow the libsvm format where each line is a record, the first 
column is a label, and then after that the fields are offset:value where offset 
is the offset into the feature vector, and value is the value of the input 
feature. 

This is a fairly efficient representation for sparse but can double (or more) 
storage requirements for dense data. 

- Evan

 On Jun 22, 2014, at 3:35 PM, Justin Yip yipjus...@gmail.com wrote:
 
 Hello,
 
 I am looking into a couple of MLLib data files in 
 https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any 
 explanation for these files? Does anyone know if they are documented?
 
 Thanks.
 
 Justin