Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-24 Thread Timur Shenkao
Hello, everybody! May be it's not a reason of your problem, but I've noticed the line in your commentaries: *java version "1.8.0_51"* It's strongly advised to use Java 1.8.0_66+ I use even Java 1.8.0_101 On Tue, Sep 20, 2016 at 1:09 AM, janardhan shetty wrote: > Yes

Re: ideas on de duplication for spark streaming?

2016-09-24 Thread Jörn Franke
As Cody said, Spark is not going to help you here. There are two issues you need to look at here: duplicated (or even more) messages processed by two different processes and the case of failure of any component (including the message broker). Keep in mind that duplicated messages can even

Spark 1.6.2 Concurrent append to a HDFS folder with different partition key

2016-09-24 Thread Shing Hing Man
I am trying to prototype using a single instance SqlContext and use it toappend Dataframes,partition by a field, to the same HDFS folder from multiple threads. (Each thread will work with a DataFrame having different partition column value.) I get the exception16/09/24 16:45:12 ERROR

Re: ideas on de duplication for spark streaming?

2016-09-24 Thread Cody Koeninger
Spark alone isn't going to solve this problem, because you have no reliable way of making sure a given worker has a consistent shard of the messages seen so far, especially if there's an arbitrary amount of delay between duplicate messages. You need a DHT or something equivalent. On Sep 24, 2016

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-24 Thread Yash Sharma
We have too many (large) files. We have about 30k partitions with about 4 years worth data and we need to process entire history in a one time monolithic job. I would like to know how spark decides the number of executors requested. I've seen testcases where the max executors count is Integer's

Re: databricks spark-csv: linking coordinates are what?

2016-09-24 Thread Anastasios Zouzias
Hi Dan, If you use spark <= 1.6, you can also do $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 to quickly link the spark-csv jars to spark shell. Otherwise as Holden suggested you link it in your maven/sbt dependencies. Spark guys assume that their users have a good

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-24 Thread ayan guha
Do you have too many small files you are trying to read? Number of executors are very high On 24 Sep 2016 10:28, "Yash Sharma" wrote: > Have been playing around with configs to crack this. Adding them here > where it would be helpful to others :) > Number of executors and

Re: Spark MLlib ALS algorithm

2016-09-24 Thread Nick Pentreath
The scale factor was only to scale up the number of ratings in the dataset for performance testing purposes, to illustrate the scalability of Spark ALS. It is not something you would normally do on your training dataset. On Fri, 23 Sep 2016 at 20:07, Roshani Nagmote

ideas on de duplication for spark streaming?

2016-09-24 Thread kant kodali
Hi Guys, I have bunch of data coming in to my spark streaming cluster from a message queue(not kafka). And this message queue guarantees at least once delivery only so there is potential that some of the messages that come in to the spark streaming cluster are actually duplicates and I am trying