Re: Is RankingMetrics' NDCG implementation correct?

2016-09-19 Thread Jong Wook Kim
don't think it's valid to use it as such. > > On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim wrote: > > Hi, > > > > I'm trying to evaluate a recommendation model, and found that Spark and > > Rival give different results, and it seems that Rival's

Is RankingMetrics' NDCG implementation correct?

2016-09-18 Thread Jong Wook Kim
Hi, I'm trying to evaluate a recommendation model, and found that Spark and Rival give different results, and it seems that Rival's one is what Kaggle defines : https://gist.github.com/jongw

Re: AVRO vs Parquet

2016-03-03 Thread Jong Wook Kim
How about ORC? I have experimented briefly with Parquet and ORC, and I liked the fact that ORC has its schema within the file, which makes it handy to work with any other tools. Jong Wook On 3 March 2016 at 23:29, Don Drake wrote: > My tests show Parquet has better performance than Avro in just

Spark-shell connecting to Mesos stuck at sched.cpp

2015-11-15 Thread Jong Wook Kim
I'm having problem connecting my spark app to a Mesos cluster; any help on the below question would be appreciated. http://stackoverflow.com/questions/33727154/spark-shell-connecting-to-mesos-stuck-at-sched-cpp Thanks, Jong Wook

Spark YARN Shuffle service wire compatibility

2015-10-22 Thread Jong Wook Kim
Hi, I’d like to know if there is a guarantee that Spark YARN shuffle service has wire compatibility between 1.x versions. I could run Spark 1.5 job with YARN nodemanagers having shuffle service 1.4, but it might’ve been just a coincidence. Now we’re upgrading CDH to 5.3 to 5.4, whose NodeManage

Re: How to maintain multiple JavaRDD created within another method like javaStreamRDD.forEachRDD

2015-07-14 Thread Jong Wook Kim
Your question is not very clear, but from what I understand, you want to deal with a stream of MyTable that has parsed records from your Kafka topics. What you need is JavaDStream, and you can use transform()

Re: ProcessBuilder in SparkLauncher is memory inefficient for launching new process

2015-07-14 Thread Jong Wook Kim
The article you've linked, is specific to an embedded system. the JVM built for that architecture (which the author didn't mention) might not be as stable and well-supported as HotSpot. ProcessBuilder is a stable Java API and despite somewhat limited functionality it is the standard method to l

Re: spark on yarn

2015-07-14 Thread Jong Wook Kim
it's probably because your YARN cluster has only 40 vCores available. Go to your resource manager and check if "VCores Total" and "Memory Total" exceeds what you have set. (40 cores and 5120 MB) If that looks fine, go to "Scheduler" page and find the queue on which your jobs run, and check the

Re: About extra memory on yarn mode

2015-07-14 Thread Jong Wook Kim
executor.memory only sets the maximum heap size of executor and the JVM needs non-heap memory to store class metadata, interned strings and other native overheads coming from networking libraries, off-heap storage levels, etc. These are (of course) legitimate usage of resources and you'll have t

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-12 Thread Jong Wook Kim
Based on my experience, YARN containers can get SIGTERM when - it produces too much logs and use up the hard drive - it uses off-heap memory more than what is given by spark.yarn.executor.memoryOverhead configuration. It might be due to too many classes loaded (less than MaxPermGen but more tha

Re: Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim
getLatest() will return the latest set of filters > that is broadcasted out, and as the transform function is processed in > every batch interval, it will always use the latest filters. > > HTH. > > TD > > On Wed, Jul 8, 2015 at 10:02 AM, Jong Wook Kim wrote: > >>

Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim
I just asked this question at the streaming webinar that just ended, but the speakers didn't answered so throwing here: AFAIK checkpoints are the only recommended method for running Spark streaming without data loss. But it involves serializing the entire dstream graph, which prohibits any logic c

Re: saveAsTextFile of RDD[Array[Any]]

2015-02-09 Thread Jong Wook Kim
If you have `RDD[Array[Any]]` you can do rdd.map(_.mkString("\t")) or with some other delimiter to make it `RDD[String]`, and then call `saveAsTextFile`. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-of-RDD-Array-Any-tp21548p21554.html Se

Re: Custom streaming receiver slow on YARN

2015-02-09 Thread Jong Wook Kim
replying to my own thread; I realized that this only happens when the replication level is 1. Regardless of whether setting memory_only or disk or deserialized, I had to make the replication level >= 2 to make the streaming work properly on YARN. I still don't get it why, because intuitively less

Custom streaming receiver slow on YARN

2015-02-07 Thread Jong Wook Kim
Hello people, I have an issue that my streaming receiver is laggy on YARN. Can anyone reply to my question on StackOverflow?: http://stackoverflow.com/questions/28370362/spark-streaming-receiver-particularly-slow-on-yarn Thanks Jong Wook -- View this message in context: http://apache-spark-u