Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Gene Pang
Hi Dmitry, I am not familiar with all of the details you have just described, but I think Tachyon should be able to help you. If you store all of your resource files in HDFS or S3 or both, you can run Tachyon to use those storage systems as the under storage (

Re: Spark and HBase RDD join/get

2016-01-14 Thread Ted Yu
For #1, yes it is possible. You can find some example in hbase-spark module of hbase where hbase as DataSource is provided. e.g. https://github.com/apache/hbase/blob/master/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseRDDFunctions.scala Cheers On Thu, Jan 14, 2016 at 5:04 AM,

RE: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Rachana Srivastava
Tried using 1.6 version of Spark that takes numberOfFeatures fifth argument in the API but still getting featureImportance as null. RandomForestClassifier rfc = getRandomForestClassifier( numTrees, maxBinSize, maxTreeDepth, seed, impurity); RandomForestClassificationModel rfm =

Re: SparkContext SyntaxError: invalid syntax

2016-01-14 Thread Andrew Weiner
Hi Bryan, I ran "$> python --version" on every node on the cluster, and it is Python 2.7.8 for every single one. When I try to submit the Python example in client mode * ./bin/spark-submit --master yarn --deploy-mode client --driver-memory 4g --executor-memory 2g

Re: [discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

2016-01-14 Thread Sean Owen
I personally support this. I had suggest drawing the line at Hadoop 2.6, but that's minor. More info: Hadoop 2.7: April 2015 Hadoop 2.6: Nov 2014 Hadoop 2.5: Aug 2014 Hadoop 2.4: April 2014 Hadoop 2.3: Feb 2014 Hadoop 2.2: Oct 2013 CDH 5.0/5.1 = Hadoop 2.3 + backports CDH 5.2/5.3 = Hadoop 2.5 +

Re: Read Accumulator value while running

2016-01-14 Thread Mennour Rostom
Hi Daniel, Andrew Thank you for your answers, So it s not possible to read the accumulator value until the action that manipulate it finishes. it's bad, I ll think to something else. However the main important thing in my application is the ability to lunche 2 (or more) actions in parallel and

Spark and HBase RDD join/get

2016-01-14 Thread Kristoffer Sjögren
Hi We have a RDD that needs to be mapped with information from HBase, where the exact key is the user id. What's the different alternatives for doing this? - Is it possible to do HBase.get() requests from a map function in Spark? - Or should we join RDDs with all full HBase table scan? I ask

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
OK so it looks like Tachyon is a cluster memory plugin marked as "experimental" in Spark. In any case, we've got a few requirements for the system we're working on which may drive the decision for how to implement large resource file management. The system is a framework of N data analyzers

Spark SQL . How to enlarge output rows ?

2016-01-14 Thread Eli Super
Hi After executing sql sqlContext.sql("select day_time from my_table limit 10").show() my output looks like : ++ | day_time| ++ |2015/12/15 15:52:...| |2015/12/15 15:53:...| |2015/12/15 15:52:...| |2015/12/15 15:52:...| |2015/12/15 15:52:...|

NPE when using Joda DateTime

2016-01-14 Thread Spencer, Alex (Santander)
Hello, I was wondering if somebody is able to help me get to the bottom of a null pointer exception I'm seeing in my code. I've managed to narrow down a problem in a larger class to my use of Joda's DateTime functions. I've successfully run my code in scala, but I've hit a few problems when

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
The other thing from some folks' recommendations on this list was Apache Ignite. Their In-Memory File System ( https://ignite.apache.org/features/igfs.html) looks quite interesting. On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg wrote: > OK so it looks like

code hangs in local master mode

2016-01-14 Thread Kai Wei
Hi list, I ran into an issue which I think could be a bug. I have a Hive table stored as parquet files. Let's say it's called testtable. I found the code below stuck forever in spark-shell with a local master or driver/executor: sqlContext.sql("select * from

RE: NPE when using Joda DateTime

2016-01-14 Thread Spencer, Alex (Santander)
Hi, I tried take(1500) and test.collect and these both work on the "single" map statement. I'm very new to Kryo serialisation, I managed to find some code and I copied and pasted and that's what originally made the single map statement work: class MyRegistrator extends KryoRegistrator {

RE: Spark SQL . How to enlarge output rows ?

2016-01-14 Thread Spencer, Alex (Santander)
Hi, Try …..show(false) public void show(int numRows, boolean truncate) Kind Regards, Alex. From: Eli Super [mailto:eli.su...@gmail.com] Sent: 14 January 2016 13:09 To: user@spark.apache.org Subject: Spark SQL . How to enlarge output rows ? Hi After executing sql

Re: NPE when using Joda DateTime

2016-01-14 Thread Sean Owen
It does look somehow like the state of the DateTime object isn't being recreated properly on deserialization somehow, given where the NPE occurs (look at the Joda source code). However the object is java.io.Serializable. Are you sure the Kryo serialization is correct? It doesn't quite explain why

Re: Spark and HBase RDD join/get

2016-01-14 Thread Kristoffer Sjögren
Thanks Ted! On Thu, Jan 14, 2016 at 4:49 PM, Ted Yu wrote: > For #1, yes it is possible. > > You can find some example in hbase-spark module of hbase where hbase as > DataSource is provided. > e.g. > >

Re: NPE when using Joda DateTime

2016-01-14 Thread Sean Owen
That's right, though it's possible the default way Kryo chooses to serialize the object doesn't work. I'd debug a little more and print out as much as you can about the DateTime object at the point it appears to not work. I think there's a real problem and it only happens to not turn up for the

RE: NPE when using Joda DateTime

2016-01-14 Thread Spencer, Alex (Santander)
I appreciate this – thank you. I’m not an admin on the box I’m using spark-shell on – so I’m not sure I can add them to that namespace. I’m hoping if I declare the JodaDateTimeSerializer class in my REPL that I can still get this to work. I think the INTERVAL part below may be key, I haven’t

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-14 Thread Jerry Lam
Hi Arkadiusz, the partitionBy is not designed to have many distinct value the last time I used it. If you search in the mailing list, I think there are couple of people also face similar issues. For example, in my case, it won't work over a million distinct user ids. It will require a lot of

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Shixiong(Ryan) Zhu
Could you change MEMORY_ONLY_SER to MEMORY_AND_DISK_SER_2 and see if this still happens? It may be because you don't have enough memory to cache the events. On Thu, Jan 14, 2016 at 4:06 PM, Lin Zhao wrote: > Hi, > > I'm testing spark streaming with actor receiver. The actor

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Lin Zhao
Hi Shixiong, I tried this but it still happens. If it helps, it's 1.6.0 and runs on YARN. Batch duration is 20 seconds. Some logs seemingly related to block manager: 16/01/15 00:31:25 INFO receiver.BlockGenerator: Pushed block input-0-1452817873000 16/01/15 00:31:27 INFO storage.MemoryStore:

Undestanding Spark Rebalancing

2016-01-14 Thread Pedro Rodriguez
Hi All, I am running a Spark program where one of my parts is using Spark as a scheduler rather than a data management framework. That is, my job can be described as RDD[String] where the string describes an operation to perform which may be cheap or expensive (process an object which may have a

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Lin Zhao
Hi Shixiong, Just figured it out. I was doing a .print() as output operation, which seems to stop the batch once it has 10 through. I changed it to a no-op foreachRDD and it works. Thanks for jumping in to help me. From: "Shixiong(Ryan) Zhu"

Re: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Bryan Cutler
If you are able to just train the RandomForestClassificationModel from ML directly instead of training the old model and converting, then that would be the way to go. On Thu, Jan 14, 2016 at 2:21 PM, wrote: > Thanks so much Bryan for your response. Is

How to bind webui to localhost?

2016-01-14 Thread Zee Chen
Hi, what is the easiest way to configure the Spark webui to bind to localhost or 127.0.0.1? I intend to use this with ssh socks proxy to provide a rudimentary "secured access". Unlike hadoop config options, Spark doesn't allow the user to directly specify the ip addr to bind services to.

Re: How to bind webui to localhost?

2016-01-14 Thread Shixiong(Ryan) Zhu
Yeah, it's hard code as "0.0.0.0". Could you send a PR to add a configuration for it? On Thu, Jan 14, 2016 at 2:51 PM, Zee Chen wrote: > Hi, what is the easiest way to configure the Spark webui to bind to > localhost or 127.0.0.1? I intend to use this with ssh socks proxy to >

Re: Sending large objects to specific RDDs

2016-01-14 Thread Daniel Imberman
Hi Ted, So unfortunately after looking into the cluster manager that I will be using for my testing (I'm using a super-computer called XSEDE rather than AWS), it looks like the cluster does not actually come with Hbase installed (this cluster is becoming somewhat problematic, as it is essentially

Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Lin Zhao
Hi, I'm testing spark streaming with actor receiver. The actor keeps calling store() to save a pair to Spark. Once the job is launched, on the UI everything looks good. Millions of events gets through every batch. However, I added logging to the first step and found that only 20 or 40 events

Re: How to bind webui to localhost?

2016-01-14 Thread Zee Chen
sure will do. On Thu, Jan 14, 2016 at 3:19 PM, Shixiong(Ryan) Zhu wrote: > Yeah, it's hard code as "0.0.0.0". Could you send a PR to add a > configuration for it? > > On Thu, Jan 14, 2016 at 2:51 PM, Zee Chen wrote: >> >> Hi, what is the easiest way to

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Shixiong(Ryan) Zhu
Could you post the codes of MessageRetriever? And by the way, could you post the screenshot of tasks for a batch and check the input size of these tasks? Considering there are so many events, there should be a lot of blocks as well as a lot of tasks. On Thu, Jan 14, 2016 at 4:34 PM, Lin Zhao

Re: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Bryan Cutler
Hi Rachana, I got the same exception. It is because computing the feature importance depends on impurity stats, which is not calculated with the old RandomForestModel in MLlib. Feel free to create a JIRA for this if you think it is necessary, otherwise I believe this problem will be eventually

Using JDBC clients with "Spark on Hive"

2016-01-14 Thread sdevashis
Hello Experts, I am getting started with Hive with Spark as the query engine. I built the package from sources. I am able to invoke Hive CLI and run queries and see in Ambari that Spark application are being created confirming hive is using Spark as the engine. However other than Hive CLI, I am

Re: SQL UDF problem (with re to types)

2016-01-14 Thread Michael Armbrust
We automatically convert types for UDFs defined in Scala, but we can't do it in Java because the types are erased by the compiler. If you want to use double you should cast before calling the UDF. On Wed, Jan 13, 2016 at 8:10 PM, Raghu Ganti wrote: > So, when I try

Re: strange behavior in spark yarn-client mode

2016-01-14 Thread Marcelo Vanzin
On Thu, Jan 14, 2016 at 10:17 AM, Sanjeev Verma wrote: > now it spawn a single executors with 1060M size, I am not able to understand > why this time it executes executors with 1G+overhead not 2G what I > specified. Where are you looking for the memory size for the

Re: SQL UDF problem (with re to types)

2016-01-14 Thread Raghu Ganti
Would this go away if the Spark source was compiled against Java 1.8 (since the problem of type erasure is solved through proper generics implementation in Java 1.8). On Thu, Jan 14, 2016 at 1:42 PM, Michael Armbrust wrote: > We automatically convert types for UDFs

Re: strange behavior in spark yarn-client mode

2016-01-14 Thread Marcelo Vanzin
Please reply to the list. The web ui does not show the total size of the executor's heap. It shows the amount of memory available for caching data, which is, give or take, 60% of the heap by default. On Thu, Jan 14, 2016 at 11:03 AM, Sanjeev Verma wrote: > I am

strange behavior in spark yarn-client mode

2016-01-14 Thread Sanjeev Verma
I am seeing a strange behaviour while running spark in yarn client mode.I am observing this on the single node yarn cluster.in spark-default I have configured the executors memory as 2g and started the spark shell as follows bin/spark-shell --master yarn-client which trigger the 2 executors on

DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-14 Thread Arkadiusz Bicz
Hi What is the proper configuration for saving parquet partition with large number of repeated keys? On bellow code I load 500 milion rows of data and partition it on column with not so many different values. Using spark-shell with 30g per executor and driver and 3 executor cores

Re: code hangs in local master mode

2016-01-14 Thread Kai Wei
Thanks for your reply, Ted. Below is the stack dump for all threads: Thread dump for executor driver Updated at 2016/01/14 20:35:41 Collapse All Thread 89: Executor task launch worker-0 (TIMED_WAITING) sun.misc.Unsafe.park(Native Method)

答复: 答复: 答复: spark streaming context trigger invoke stop why?

2016-01-14 Thread Triones,Deng(vip.com)
Thanks for your response . Our code as below : public void process(){ logger.info("streaming process start !!!"); SparkConf sparkConf = createSparkConf(this.getClass().getSimpleName()); JavaStreamingContext jsc = this.createJavaStreamingContext(sparkConf);

livy test problem: Failed to execute goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project livy-spark_2.10: There are test failures

2016-01-14 Thread Ruslan Dautkhanov
Livy build test from master fails with below problem. Can't track it down. YARN shows Livy Spark yarn application as running. Although attempt to connect to application master shows connection refused: HTTP ERROR 500 > Problem accessing /proxy/application_1448640910222_0046/. Reason: >

DataFrame partitionBy to a single Parquet file (per partition)

2016-01-14 Thread Patrick McGloin
Hi, I would like to reparation / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this: df.coalesce(1).write.partitionBy("entity", "year", "month", "day",

Re: NPE when using Joda DateTime

2016-01-14 Thread Durgesh Verma
Today is my day... Trying to go thru where I can pitch in. Let me know if below makes sense. I looked at joda Java Api source code (1.2.9) and traced that call in NPE. It looks like AssembledChronology class is being used, the iYears instance variable is defined as transient.

Can we use localIterator when we need to process data in one partition?

2016-01-14 Thread unk1102
Hi I have special requirement when I need to process data in one partition at the last after doing many filtering,updating etc in a DataFrame. Currently to process data in one partition I am using coalesce(1) which is killing and painfully slow my jobs hangs for hours even 5-6 hours and I dont

Re: 1.6.0: Standalone application: Getting ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

2016-01-14 Thread Egor Pahomov
My fault, I should have read documentation more accurate - http://spark.apache.org/docs/latest/sql-programming-guide.html precisely says, that I need to add these 3 jars to class path in case I need them. We can not include them in fat jar, because they OSGI and require to have plugin.xml and

Set Hadoop User in Spark Shell

2016-01-14 Thread Daniel Valdivia
Hi, I'm trying to set the value of a hadoop parameter within spark-shell, and System.setProperty("HADOOP_USER_NAME", "hadoop") seem to not be doing the trick Does anything know how I can set the hadoop.job.ugi parameter from within spark-shell ? Cheers

Re: SQL UDF problem (with re to types)

2016-01-14 Thread Michael Armbrust
I don't believe that Java 8 got rid of erasure. In fact I think its actually worse when you use Java 8 lambdas. On Thu, Jan 14, 2016 at 10:54 AM, Raghu Ganti wrote: > Would this go away if the Spark source was compiled against Java 1.8 > (since the problem of type erasure

Re: 101 question on external metastore

2016-01-14 Thread Yana Kadiyska
If you have a second could you post the version of derby that you installed, the contents of hive-site.xml and the command you use to run (along with spark version?). I'd like to retry the installation. On Thu, Jan 7, 2016 at 7:35 AM, Deenar Toraskar wrote: > I

Re: Usage of SparkContext within a Web container

2016-01-14 Thread Eugene Morozov
Praveen, Zeppelin uses Spark's REPL. I'm currently writing an app that is a web service, which is going to run spark jobs. So, at the init stage I just create JavaSparkContext and then use it for all users requests. Web service is stateless. The issue with stateless is that it's possible to run

Re: code hangs in local master mode

2016-01-14 Thread Ted Yu
Can you capture one or two stack traces of the local master process and pastebin them ? Thanks On Thu, Jan 14, 2016 at 6:01 AM, Kai Wei wrote: > Hi list, > > I ran into an issue which I think could be a bug. > > I have a Hive table stored as parquet files. Let's say it's

Re: NPE when using Joda DateTime

2016-01-14 Thread Todd Nist
I had a similar problem a while back and leveraged these Kryo serializers, https://github.com/magro/kryo-serializers. I had to fallback to version 0.28, but that was a while back. You can add these to the org.apache.spark.serializer.KryoRegistrator and then set your registrator in the spark

Spark 1.5.2 streaming driver in YARN cluster mode on Hadoop 2.6 (on EMR 4.2) restarts after stop

2016-01-14 Thread Roberto Coluccio
Hi there, I'm facing a weird issue when upgrading from Spark 1.4.1 streaming driver on EMR 3.9 (hence Hadoop 2.4.0) to Spark 1.5.2 on EMR 4.2 (hence Hadoop 2.6.0). Basically, the very same driver which used to terminate after a timeout as expected, now does not. In particular, as long as the

Re: NPE when using Joda DateTime

2016-01-14 Thread Shixiong(Ryan) Zhu
Could you try to use "Kryo.setDefaultSerializer" like this: class YourKryoRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.setDefaultSerializer(classOf[com.esotericsoftware.kryo.serializers.JavaSerializer]) } } On Thu, Jan 14, 2016 at 12:54 PM, Durgesh

Re: 答复: 答复: spark streaming context trigger invoke stop why?

2016-01-14 Thread Shixiong(Ryan) Zhu
Could you show your codes? Did you use `StreamingContext.awaitTermination`? If so, it will return if any exception happens. On Wed, Jan 13, 2016 at 11:47 PM, Triones,Deng(vip.com) < triones.d...@vipshop.com> wrote: > What’s more, I am running a 7*24 hours job , so I won’t call System.exit() > by