Re: How to write mapreduce programming in spark by using java on user-defined javaPairRDD?

2015-07-07 Thread Feynman Liang
Hi MIssie, In the Java API, you should consider: 1. RDD.map https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#map(scala.Function1,%20scala.reflect.ClassTag) to transform the text 2. RDD.sortBy

Regarding master node failure

2015-07-07 Thread swetha
Hi, What happens if the master node fails in the case of Spark Streaming? Would the data be lost? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Regarding-master-node-failure-tp23701.html Sent from the Apache Spark User List mailing list

Re: java.lang.OutOfMemoryError: PermGen space

2015-07-07 Thread jitender
Stati, Change SPARK_REPL_OPTS to SPARK_SUBMIT_OPTS and try again. I faced the same issue and making this change worked for me. I looked at the spark-shell file under the bin dir and found SPARK_SUBMIT_OPTS being used. SPARK_SUBMIT_OPTS=-XX:MaxPermSize=256m bin/spark-shell --master

Re: Windows - endless Dependency-reduced POM written... in Bagel build

2015-07-07 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtxVUrL1AvnPj2 On Tue, Jul 7, 2015 at 10:04 AM, Lincoln Atkinson lat...@microsoft.com wrote: I’m trying to build Spark from source on Windows 8.1, using a recent Cygwin install and JDK 8u45. From the root of my enlistment, I’m running

DataFrame question

2015-07-07 Thread Naveen Madhire
Hi All, I am working with dataframes and have been struggling with this thing, any pointers would be helpful. I've a Json file with the schema like this, links: array (nullable = true) ||-- element: struct (containsNull = true) |||-- desc: string (nullable = true) |||--

Re: Windows - endless Dependency-reduced POM written... in Bagel build

2015-07-07 Thread Ted Yu
Looks like a workaround has gone in: [SPARK-8819] Fix build for maven 3.3.x FYI On Tue, Jul 7, 2015 at 10:09 AM, Ted Yu yuzhih...@gmail.com wrote: See this thread: http://search-hadoop.com/m/q3RTtxVUrL1AvnPj2 On Tue, Jul 7, 2015 at 10:04 AM, Lincoln Atkinson lat...@microsoft.com wrote:

Re: Hibench build fail

2015-07-07 Thread Ted Yu
bq. Need I specify my spark version Looks like the build used 1.4.0 SNAPSHOT. Please use 1.4.0 release. Cheers On Mon, Jul 6, 2015 at 11:50 PM, luohui20...@sina.com wrote: Hi grace, recently I am trying Hibench to evaluate my spark cluster, however I got a problem in building Hibench,

Windows - endless Dependency-reduced POM written... in Bagel build

2015-07-07 Thread Lincoln Atkinson
I'm trying to build Spark from source on Windows 8.1, using a recent Cygwin install and JDK 8u45. From the root of my enlistment, I'm running `build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package` The build moves along just fine for a while, until it builds Spark

Re: Master doesn't start, no logs

2015-07-07 Thread Max Demoulin
Thanks, I tried that, and the result was the same. I still can start a master from the spark-1.4.0-bin-hadoop2.4 pre-built version thought I don't really know what to show more than the strace that I already linked, so I could use any hint for that. -- Henri Maxime Demoulin 2015-07-07 9:53

RE: Windows - endless Dependency-reduced POM written... in Bagel build

2015-07-07 Thread Lincoln Atkinson
That solved it. Thanks! From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, July 07, 2015 10:21 AM To: Lincoln Atkinson Cc: user@spark.apache.org Subject: Re: Windows - endless Dependency-reduced POM written... in Bagel build Looks like a workaround has gone in: [SPARK-8819] Fix build

Re: How do we control output part files created by Spark job?

2015-07-07 Thread Umesh Kacha
Hi I tried both approach using df. repartition(6) and df.coalesce(6) it doesn't reduce part-x files. Even after calling above method I still see around 200 small part files of size 20 mb each which is again orc files. On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu

Is it now possible to incrementally update a graph in GraphX

2015-07-07 Thread Hellen
I found this post back in March 2014. http://apache-spark-user-list.1001560.n3.nabble.com/Incrementally-add-remove-vertices-in-GraphX-td2227.html I was wondering if there is any progress on GraphX Streaming/incremental graph update in GraphX. Or is there a place where I can track the progress on

How to deal with null values on LabeledPoint

2015-07-07 Thread Saif.A.Ellafi
Hello, reading from spark-csv, got some lines with missing data (not invalid). applying map() to create a LabeledPoint with denseVector. Using map( Row = Row.getDouble(col_index) ) To this point: res173: org.apache.spark.mllib.regression.LabeledPoint =

Re: Regarding master node failure

2015-07-07 Thread Tathagata Das
This talk may help - https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/ On Tue, Jul 7, 2015 at 9:51 AM, swetha swethakasire...@gmail.com wrote: Hi, What happens if the master node fails in the case of Spark Streaming? Would the data be lost?

Hibench build fail

2015-07-07 Thread luohui20001
Hi grace, recently I am trying Hibench to evaluate my spark cluster, however I got a problem in building Hibench, would you help to take a look? thanks. It fails at building Sparkbench, and you may check the attched pic for more info. My spark version :1.3.1,hadoop version :2.7.0

Re: Master doesn't start, no logs

2015-07-07 Thread Akhil Das
Strange. What are you having in $SPARK_MASTER_IP? It may happen that it is not able to bind to the given ip but again it should be in the logs. Thanks Best Regards On Tue, Jul 7, 2015 at 12:54 AM, maxdml maxdemou...@gmail.com wrote: Hi, I've been compiling spark 1.4.0 with SBT, from the

Re: How to debug java.io.OptionalDataException issues

2015-07-07 Thread Akhil Das
Did you try kryo? Wrap everything with kryo and see if you are still hitting the exception. (At least you could see a different exception stack). Thanks Best Regards On Tue, Jul 7, 2015 at 6:05 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, suffering from a pretty strange issue:

SparkSQL OOM issue

2015-07-07 Thread shshann
Dear all, We've tried to use sparkSql to do some insert from A table to B table action where using the exact same SQL script, hive is able to finish it but Spark 1.3.1 would always end with OOM issue; we tried several configuration including: --executor-cores 2 --num-executors 300

Re: SparkSQL OOM issue

2015-07-07 Thread Xiaoyu Ma
Hi, Where did OOM happened? In Driver or executor? Sometimes SparkSQL Driver OOM on tables with large number partitions. If so, you might want to increase it in spark-defaults.conf spark.driver.memory Shawn On Jul 7, 2015, at 3:58 PM, shsh...@tsmc.com wrote: Dear all, We've tried to

Re: Can we allow executor to exit when tasks fail too many time?

2015-07-07 Thread Tao Li
Any Response? 2015-07-06 12:28 GMT+08:00 Tao Li litao.bupt...@gmail.com: ​ Node cloud10141049104.wd.nm.nop.sogou-op.org and cloud101417770.wd.nm.ss.nop.sogou-op.org failed too many times, I want to know if it can be auto offline when failed too many times? 2015-07-06 12:25 GMT+08:00 Tao

HiveContext throws org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2015-07-07 Thread bdev
Just trying to get started with Spark and attempting to use HiveContext using spark-shell to interact with existing Hive tables on my CDH cluster but keep running into the errors (pls see below) when I do 'hiveContext.sql(show tables)'. Wanted to know what all JARs need to be included to have this

Maintain Persistent Connection with Hive meta store

2015-07-07 Thread wazza
Hi I am new to Apache Spark and I have tried to query hive tables using Apache Spark Sql. First I have tried it in Spark-shell where I can query 1 lakh records from hive table within a second. Then I have tried in a java code which always take more than 10 seconds and I have noted that each time

Re: sparkr-submit additional R files

2015-07-07 Thread Shivaram Venkataraman
You can just use `--files` and I think it should work. Let us know on https://issues.apache.org/jira/browse/SPARK-6833 if it doesn't work as expected. Thanks Shivaram On Tue, Jul 7, 2015 at 5:13 AM, Michał Zieliński zielinski.mich...@gmail.com wrote: Hi all, *spark-submit* for Python and

How to change hive database?

2015-07-07 Thread Arun Luthra
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext I'm getting org.apache.spark.sql.catalyst.analysis.NoSuchTableException from: val dataframe = hiveContext.table(other_db.mytable) Do I have to change current database to access it? Is it possible to

Re: How to change hive database?

2015-07-07 Thread Ted Yu
See this thread http://search-hadoop.com/m/q3RTt0NFls1XATV02 Cheers On Tue, Jul 7, 2015 at 11:07 AM, Arun Luthra arun.lut...@gmail.com wrote: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext I'm getting

(de)serialize DStream

2015-07-07 Thread Chen Song
In Spark Streaming, when using updateStateByKey, it requires the generated DStream to be checkpointed. It seems that it always use JavaSerializer, no matter what I set for spark.serializer. Can I use KryoSerializer for checkpointing? If not, I assume the key and value types have to be

Re: how to black list nodes on the cluster

2015-07-07 Thread Gylfi
Hi again, Ok, now I do not know of any way to fix the problem other then delete the bad machine from the config + restart .. And you will need admin privileges on cluster for that :( However, before we give up on the speculative execution, I suspect that the task is being run again and again on

Re: DataFrame question

2015-07-07 Thread Michael Armbrust
You probably want to explode the array to produce one row per element: df.select(explode(df(links)).alias(link)) On Tue, Jul 7, 2015 at 10:29 AM, Naveen Madhire vmadh...@umail.iu.edu wrote: Hi All, I am working with dataframes and have been struggling with this thing, any pointers would be

What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-07 Thread Arun Ahuja
Is there more documentation on what is needed to setup BLAS/LAPACK native suport with Spark. I’ve built spark with the -Pnetlib-lgpl flag and see that the netlib classes are in the assembly jar. jar tvf spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar | grep netlib | grep Native 6625 Tue Jul 07

Re: How to implement top() and filter() on object List for JavaRDD

2015-07-07 Thread Ted Yu
bq. my class has already implemented the java.io.Serializable Can you show the code for Model.User class ? Cheers On Tue, Jul 7, 2015 at 8:18 AM, Hafsa Asif hafsa.a...@matchinguu.com wrote: Thank u so much for the solution. I run the code like this, JavaRDDUser rdd =

Re: is it possible to disable -XX:OnOutOfMemoryError=kill %p for the executors?

2015-07-07 Thread Marcelo Vanzin
SIGTERM on YARN generally means the NM is killing your executor because it's running over its requested memory limits. Check your NM logs to make sure. And then take a look at the memoryOverhead setting for driver and executors (http://spark.apache.org/docs/latest/running-on-yarn.html). On Tue,

Re: How do we control output part files created by Spark job?

2015-07-07 Thread Gylfi
Hi. I am just wondering if the rdd was actually modified. Did you test it by printing rdd.partitions.length before and after? Regards, Gylfi. -- View this message in context:

Re: unable to bring up cluster with ec2 script

2015-07-07 Thread Arun Ahuja
Sorry, I can't help with this issue, but if you are interested in a simple way to launch a Spark cluster on Amazon, Spark is now offered as an application in Amazon EMR. With this you can have a full cluster with a few clicks: https://aws.amazon.com/blogs/aws/new-apache-spark-on-amazon-emr/ -

Re: Spark Kafka Direct Streaming

2015-07-07 Thread Tathagata Das
When you enable checkpointing by setting the checkpoint directory, you enable metadata checkpointing. Data checkpointing kicks in only if you are using a DStream operation that requires it, or you are enabling Write Ahead Logs to prevent data loss on driver failure. More discussion -

unable to bring up cluster with ec2 script

2015-07-07 Thread Pagliari, Roberto
I'm following the tutorial about Apache Spark on EC2. The output is the following: $ ./spark-ec2 -i ../spark.pem -k spark --copy launch spark-training Setting up security groups... Searching for existing cluster spark-training... Latest Spark AMI: ami-19474270 Launching

Does spark supports the Hive function posexplode function?

2015-07-07 Thread Jeff J Li
I am trying to use the posexplode function in the HiveContext to auto-generate a sequence number. This feature is supposed to be available Hive 0.13.0. SELECT name, phone FROM contact LATERAL VIEW posexplode(phoneList.phoneNumber) phoneTable AS pos, phone My test program failed with the

Re: How do we control output part files created by Spark job?

2015-07-07 Thread ponkin
Hi, Did you try to reduce number of executors and cores? usually num-executors * executor-cores = number of parallel tasks, so you can reduce number of parallel tasks in command line like ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --num-executors

Best practice for using singletons on workers (seems unanswered) ?

2015-07-07 Thread dgoldenberg
Hi, I am seeing a lot of posts on singletons vs. broadcast variables, such as * http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html *

spark-submit can not resolve spark-hive_2.10

2015-07-07 Thread Hao Ren
I want to add spark-hive as a dependence to submit my job, but it seems that spark-submit can not resolve it. $ ./bin/spark-submit \ → --packages org.apache.spark:spark-hive_2.10:1.4.0,org.postgresql:postgresql:9.3-1103-jdbc3,joda-time:joda-time:2.8.1 \ → --class

Re: Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-07 Thread maxdml
I think the properties that you have in your hdfs-site.xml should go in the core-site.xml (at least for the namenode.name and datanote.data ones). I might be wrong here, but that's what I have in my setup. you should also add hadoop.tmp.dir in your core-site.xml. That might be the source of your

Re: How to implement top() and filter() on object List for JavaRDD

2015-07-07 Thread Hafsa Asif
Thank u so much for the solution. I run the code like this, JavaRDDUser rdd = context.parallelize(usersList); JavaRDDUser rdd_sorted_users= rdd.sortBy(new FunctionUser,String(){ @Override public String call(User usr1) throws Exception {

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-07 Thread Richard Marscher
Would it be possible to have a wrapper class that just represents a reference to a singleton holding the 3rd party object? It could proxy over calls to the singleton object which will instantiate a private instance of the 3rd party object lazily? I think something like this might work if the

Re: Error when connecting to Spark SQL via Hive JDBC driver

2015-07-07 Thread Eric Pederson
Hi Ratio - You need more than just hive-jdbc jar. Here are all of the jars that I found were needed. I got this list from https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-RunningtheJDBCSampleCode plus trial and error. [image: Inline image 1] -- Eric On

Re: How do we control output part files created by Spark job?

2015-07-07 Thread Srikanth
Did you do yourRdd.coalesce(6).saveAsTextFile() or yourRdd.coalesce(6) yourRdd.saveAsTextFile() ? Srikanth On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi I tried both approach using df. repartition(6) and

spark - redshift !!!

2015-07-07 Thread spark user
Hi Can you help me how to load data from s3 bucket to  redshift , if you gave sample code can you pls send me  Thanks su

Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-07 Thread Brandon White
Say I have a spark job that looks like following: def loadTable1() { val table1 = sqlContext.jsonFile(ss3://textfiledirectory/) table1.cache().registerTempTable(table1)} def loadTable2() { val table2 = sqlContext.jsonFile(ss3://testfiledirectory2/)

Why can I not insert into TempTables in Spark SQL?

2015-07-07 Thread Brandon White
Why does this not work? Is insert into broken in 1.3.1? val ssc = new StreamingContext(sc, Minutes(10)) val currentStream = ssc.textFileStream(ss3://textFileDirectory/) val dayBefore = sqlContext.jsonFile(ss3://textFileDirectory/) dayBefore.saveAsParquetFile(/tmp/cache/dayBefore.parquet) val

Hive UDFs

2015-07-07 Thread chrish2312
I know the typical way to apply a hive UDF to a dataframe is basically something like: dataframe.selectExpr(reverse(testString) as reversedString) Is there a way to apply the hive UDF just to a single row and get a row back? Something like: dataframe.first.selectExpr(reverse(testString) as

Re: spark-submit can not resolve spark-hive_2.10

2015-07-07 Thread Burak Yavuz
spark-hive is excluded when using --packages, because it can be included in the spark-assembly by adding -Phive during mvn package or sbt assembly. Best, Burak On Tue, Jul 7, 2015 at 8:06 AM, Hao Ren inv...@gmail.com wrote: I want to add spark-hive as a dependence to submit my job, but it

Re:

2015-07-07 Thread Gerard Maas
Anand, AFAIK, you will need to change two settings: spark.streaming.unpersist = false // in order for SStreaming to not drop the raw RDD data spark.cleaner.ttl = some reasonable value in seconds Also be aware that the lineage of your union RDD will grow with each batch interval. You will need

sparkr-submit additional R files

2015-07-07 Thread Michał Zieliński
Hi all, *spark-submit* for Python and Java/Scala has *--py-files* and *--jars* options for submitting additional files on top of the main application. Is there any such option for *sparkr-submit*? I know that there is *includePackage() *R function to add library dependencies, but can you add

Re: User Defined Functions - Execution on Clusters

2015-07-07 Thread Eskilson,Aleksander
Interesting, thanks for the heads up. On 7/6/15, 7:19 PM, Davies Liu dav...@databricks.com wrote: Currently, Python UDFs run in a Python instances, are MUCH slower than Scala ones (from 10 to 100x). There is JIRA to improve the performance: https://issues.apache.org/jira/browse/SPARK-8632, After

Question about master memory requirement and GraphX pagerank performance !

2015-07-07 Thread Khaled Ammar
Hi all, I am fairly new to spark and wonder if you can help me. I am exploring GraphX/Spark by running the pagerank example on a medium size graph (12 GB) using this command: My cluster is 1+16 machines, the master has 15 GB memory and each worker has 30 GB. The master has 2 cores and each

RE:

2015-07-07 Thread Evo Eftimov
spark.streaming.unpersist = false // in order for SStreaming to not drop the raw RDD data spark.cleaner.ttl = some reasonable value in seconds why is the above suggested provided the persist/vache operation on the constantly unioniuzed Batch RDD will have to be invoked anyway (after every

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-07 Thread Akhil Das
Can you try adding sc.stop at the end of your program? looks like its having a hard-time closing off sparkcontext. Thanks Best Regards On Tue, Jul 7, 2015 at 4:08 PM, Hafsa Asif hafsa.a...@matchinguu.com wrote: Hi, I run the following simple Java spark standalone app with maven command

Re: How to implement top() and filter() on object List for JavaRDD

2015-07-07 Thread Akhil Das
Here's a simplified example: SparkConf conf = new SparkConf().setAppName( Sigmoid).setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ListString user = new ArrayListString(); user.add(Jack); user.add(Jill);

Re: How to implement top() and filter() on object List for JavaRDD

2015-07-07 Thread Hafsa Asif
I have also tried this stupid code snippet, only thinking that it may even compile code Function1User, Object FILTER_USER = new AbstractFunction1User, Object () { public Object apply(User user){ return user; } }; FILTER_USER is fine but cannot be applied to the

Please add the Cincinnati spark meetup to the list of meet ups

2015-07-07 Thread Darin McBeath
 http://www.meetup.com/Cincinnati-Apache-Spark-Meetup/ Thanks. Darin.

Re:

2015-07-07 Thread Gerard Maas
Evo, I'd let the OP clarify the question. I'm not in position of clarifying his requirements beyond what's written on the question. Regarding window vs mutable union: window is a well-supported feature that accumulates messages over time. The mutable unioning of RDDs is bound to operational

Re: How to implement top() and filter() on object List for JavaRDD

2015-07-07 Thread ayan guha
I would suggest you take alook to DataFrames. Also, I do not think you should implement comparators for user class as a whole, rather you should get the attribute to sort/compar on and delete sorting to data type of inherent attribute. Eg. sorting can be done by name and if so, it should be string

Re: Master doesn't start, no logs

2015-07-07 Thread Max Demoulin
Yes, I do set $SPARK_MASTER_IP. I suspect a more internal issue, maybe due to multiple spark/hdfs instances having successively run on the same machine? -- Henri Maxime Demoulin 2015-07-07 4:10 GMT-04:00 Akhil Das ak...@sigmoidanalytics.com: Strange. What are you having in $SPARK_MASTER_IP? It

RECEIVED SIGNAL 15: SIGTERM

2015-07-07 Thread Kostas Kougios
I am still receiving these weird sigterms on the executors. The driver claims it lost the executor, the executor receives a SIGTERM (from whom???) It doesn't seem a memory related issue though increasing memory takes the job a bit further or completes it. But why? there is no memory pressure on

How to implement top() and filter() on object List for JavaRDD

2015-07-07 Thread Hafsa Asif
Hi, I have an object list of Users and I want to implement top() and filter() methods on the object list. Let me explain you the whole scenario: 1. I have User object list named usersList. I fill it during record set. User user = new User();

[SPARK-SQL] libgplcompression.so already loaded in another classloader

2015-07-07 Thread Sea
Hi, all I found an Exception when using spark-sql java.lang.UnsatisfiedLinkError: Native Library /data/lib/native/libgplcompression.so already loaded in another classloader ... I set spark.sql.hive.metastore.jars=. in file spark-defaults.conf It does not happen every time. Who knows

How to solve ThreadException in Apache Spark standalone Java Application

2015-07-07 Thread Hafsa Asif
Hi, I run the following simple Java spark standalone app with maven command exec:java -Dexec.mainClass=SimpleApp public class SimpleApp { public static void main(String[] args) { System.out.println(Reading and Connecting with Spark.); try { String logFile =

Re: Spark Standalone Cluster - Slave not connecting to Master

2015-07-07 Thread Himanshu Mehra
Hi MorEru, same problem occurred to. i had to change the version of maven dependency from spark_core_2.11 to spark_core_2.10 and it worked. Thanks Himanshu -- View this message in context:

[no subject]

2015-07-07 Thread Anand Nalya
Hi, Suppose I have an RDD that is loaded from some file and then I also have a DStream that has data coming from some stream. I want to keep union some of the tuples from the DStream into my RDD. For this I can use something like this: var myRDD: RDD[(String, Long)] = sc.fromText...

How to solve ThreadException in Apache Spark standalone Java Application

2015-07-07 Thread Hafsa Asif
Hi, I run the following simple Java spark standalone app with maven command exec:java -Dexec.mainClass=SimpleApp public class SimpleApp { public static void main(String[] args) { System.out.println(Reading and Connecting with Spark.); try { String logFile =

Re: How to implement top() and filter() on object List for JavaRDD

2015-07-07 Thread Hafsa Asif
Thank u for your quick response. But, I tried this and get the error as shown in pic error.jpg http://apache-spark-user-list.1001560.n3.nabble.com/file/n23676/error.jpg -- View this message in context:

Re: The auxService:spark_shuffle does not exist

2015-07-07 Thread Jeff Zhang
Did you enable the dynamic resource allocation ? You can refer to this page for how to configure spark shuffle service for yarn. https://spark.apache.org/docs/1.4.0/job-scheduling.html On Tue, Jul 7, 2015 at 10:55 PM, roy rp...@njit.edu wrote: we tried --master yarn-client with no different

RE: Hive UDFs

2015-07-07 Thread Cheng, Hao
dataframe.limit(1).selectExpr(xxx).collect()? -Original Message- From: chrish2312 [mailto:c...@palantir.com] Sent: Wednesday, July 8, 2015 6:20 AM To: user@spark.apache.org Subject: Hive UDFs I know the typical way to apply a hive UDF to a dataframe is basically something like:

Re: SparkSQL OOM issue

2015-07-07 Thread shshann
Hi Shawn, Thank alot that's actually the last parameter we overlooked!! I'm able to run the same sql on spark now if I set the spark.driver.memoory larger, thanks again!! -- Best Regards, Felicia Shann 單師涵 +886-3-5636688 Ext. 7124300 |- |Xiaoyu

How to submit streaming application and exit

2015-07-07 Thread Bin Wang
I'm writing a streaming application and want to use spark-submit to submit it to a YARN cluster. I'd like to submit it in a client node and exit spark-submit after the application is running. Is it possible?

RE: Hibench build fail

2015-07-07 Thread Huang, Jie
Hi Hui, Could you please add more descriptions (about the failure) in HiBench github Issues? HiBench works with spark 1.2 and above. Thank you Best Regards, Grace (Huang Jie) From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, July 8, 2015 12:50 AM To: 罗辉 Cc: user; Huang, Jie Subject:

Re: spark - redshift !!!

2015-07-07 Thread Peter Zybrick
Hi, I have done a lot of EMR-S3-Redshift using Redshift COPY, haven't done any from Spark yet but I plan on doing it soon and have been doing some research. Take a look at this article - Best Practices for Micro-Batch Loading on Amazon Redshift

[no subject]

2015-07-07 Thread 付雅丹
Hi, everyone! I've got key,value pair in form of LongWritable, Text, where I used the following code: SparkConf conf = new SparkConf().setAppName(MapReduceFileInput); JavaSparkContext sc = new JavaSparkContext(conf); Configuration confHadoop = new Configuration(); JavaPairRDDLongWritable,Text

How to write mapreduce programming in spark by using java on user-defined javaPairRDD?

2015-07-07 Thread 付雅丹
Hi, everyone! I've got key,value pair in form of LongWritable, Text, where I used the following code: SparkConf conf = new SparkConf().setAppName(MapReduceFileInput); JavaSparkContext sc = new JavaSparkContext(conf); Configuration confHadoop = new Configuration(); JavaPairRDDLongWritable,Text

Re:

2015-07-07 Thread Ted Yu
Please take a look at core/src/test/java/org/apache/spark/JavaAPISuite.java in source code. Cheers On Tue, Jul 7, 2015 at 7:17 AM, 付雅丹 yadanfu1...@gmail.com wrote: Hi, everyone! I've got key,value pair in form of LongWritable, Text, where I used the following code: SparkConf conf = new

Re: Job consistently failing after leftOuterJoin() - oddly sized / non-uniform partitions

2015-07-07 Thread beancinematics
Right, I figured I'd need a custom partitioner from what I've read around! Documentation on this is super sparse; do you have any recommended links on solving data skew and/or creating custom partitioners in Spark 1.4? I'd also love to hear if this is an unusual problem with my type of set-up -

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-07 Thread Hafsa Asif
I tried also sc.stop(). Sorry I didnot include that in my question, but still getting thread exception. It is also need to mention that I am working on VM Machine. 15/07/07 06:00:32 ERROR ActorSystemImpl: Uncaught error from thread [sparkDriver-akka.actor.default-dispatcher-5]

RE:

2015-07-07 Thread Evo Eftimov
Requirements – then see my abstracted interpretation – what else do you need in terms of Requirements …: “Suppose I have an RDD that is loaded from some file and then I also have a DStream that has data coming from some stream. I want to keep union some of the tuples from the DStream into

Re: Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-07 Thread MorEru
core-site.xml configuration property namefs.default.name/name valuehdfs://localhost:9000/value /property /configuration hdfs_site.xml - configuration property namedfs.replication/name value1/value /property property namedfs.namenode.name.dir/name

Re: is it possible to disable -XX:OnOutOfMemoryError=kill %p for the executors?

2015-07-07 Thread Kostas Kougios
it seems it is hardcoded in ExecutorRunnable.scala : val commands = prefixEnv ++ Seq( YarnSparkHadoopUtil.expandEnvironment(Environment.JAVA_HOME) + /bin/java, -server, // Kill if OOM is raised - leverage yarn's failure handling to cause rescheduling. // Not killing the

Re: Master doesn't start, no logs

2015-07-07 Thread Akhil Das
Can you try renaming the ~/.ivy2 file to ~/.ivy2_backup and build spark1.4.0 again and run it? Thanks Best Regards On Tue, Jul 7, 2015 at 6:27 PM, Max Demoulin maxdemou...@gmail.com wrote: Yes, I do set $SPARK_MASTER_IP. I suspect a more internal issue, maybe due to multiple spark/hdfs

is it possible to disable -XX:OnOutOfMemoryError=kill %p for the executors?

2015-07-07 Thread Kostas Kougios
I get a suspicious sigterm on the executors that doesnt seem to be from the driver. The other thing that might send a sigterm is the -XX:OnOutOfMemoryError=kill %p java arg that the executor starts with. Now my tasks dont seem to run out of mem, so how can I disable this param to debug them? --

Re: Spark Standalone Cluster - Slave not connecting to Master

2015-07-07 Thread MorEru
Hi Himanshu, I am using spark_core_2.10 in my maven dependency. There were no issues with that. The problem I had with this was that the spark master was running on localhost inside the vm and the slave was not able to connect it. I changed the spark master to run on the private IP address

How to verify that the worker is connected to master in CDH5.4

2015-07-07 Thread Ashish Dutt
Hi, I have CDH 5.4 installed on a linux server. It has 1 cluster in which spark is deployed as a history server. I am trying to connect my laptop to the spark history server. When I run spark-shell master ip: port number I get the following output How can I verify that the worker is connected to

Re: How to submit streaming application and exit

2015-07-07 Thread ayan guha
spark-submit is nothing but a process in your OS, so you should be able to submit it in background and exit. However, your spark-submit process itself is the driver for your spark streaming application, so it will not exit for the lifetime of the streaming app. On Wed, Jul 8, 2015 at 1:13 PM, Bin

Re: How to verify that the worker is connected to master in CDH5.4

2015-07-07 Thread Guru Medasani
Hi Ashish, Are you running Spark-on-YARN on the cluster with an instance of Spark History server? Also if you are using Cloudera Manager and using Spark on YARN, spark on yarn service has a link for the history server web UI. Can you paste the command and the output you are seeing in the

Re: How to verify that the worker is connected to master in CDH5.4

2015-07-07 Thread Ashish Dutt
Thank you Ayan for your response.. But I have just realised that the Spark is configured to be a history server. Please, can somebody suggest to me how can I convert Spark history server to be a Master server? Thank you Sincerely, Ashish Dutt On Wed, Jul 8, 2015 at 12:28 PM, ayan guha

Re: How to verify that the worker is connected to master in CDH5.4

2015-07-07 Thread Ashish Dutt
Hello Guru, Thank you for your quick response. This is what i get when I try executing spark-shell master ip:port number C:\spark-1.4.0\binspark-shell master IP:18088 log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please

Re: How to create empty RDD

2015-07-07 Thread ๏̯͡๏
It worked Zhou. On Mon, Jul 6, 2015 at 10:43 PM, Wei Zhou zhweisop...@gmail.com wrote: I userd val output: RDD[(DetailInputRecord, VISummary)] = sc.emptyRDD[(DetailInputRecord, VISummary)] to create empty RDD before. Give it a try, it might work for you too. 2015-07-06 14:11 GMT-07:00

Catalyst Errors when building spark from trunk

2015-07-07 Thread Stephen Boesch
The following errors are occurring upon building using mvn options clean package Are there some requirements/restrictions on profiles/settings for catalyst to build properly? [error] /shared/sparkup2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:138: value

Re: How to verify that the worker is connected to master in CDH5.4

2015-07-07 Thread Guru Medasani
Hi Ashish, If you are not using Spark on YARN and instead using Spark Standalone, you don’t need Spark history server. More on the Web Interfaces is provided in the following link. Since are using standalone mode, you should be able to access the web UI for the master and workers at ports that

how to use DoubleRDDFunctions on mllib Vector?

2015-07-07 Thread 诺铁
hi, there are some useful functions in DoubleRDDFunctions, which I can use if I have RDD[Double], eg, mean, variance. Vector doesn't have such methods, how can I convert Vector to RDD[Double], or maybe better if I can call mean directly on a Vector?

Re: How to verify that the worker is connected to master in CDH5.4

2015-07-07 Thread ayan guha
On UI? Master: http://masterip:8080 Worker: http://workerIp:8081 On Wed, Jul 8, 2015 at 1:42 PM, Ashish Dutt ashish.du...@gmail.com wrote: Hi, I have CDH 5.4 installed on a linux server. It has 1 cluster in which spark is deployed as a history server. I am trying to connect my laptop to the

Re: How to verify that the worker is connected to master in CDH5.4

2015-07-07 Thread Ashish Dutt
Hello Guru, Many thanks for your reply. I am new to this who thing. So pardon me for my naiivety at times. I am not sure if I am using Spark standalone or Spark on Yarn because when I check the port number of Spark it shows it as 18088 and like you have mentioned maybe it is then Spark on Yarn.

Re: How do we control output part files created by Spark job?

2015-07-07 Thread Umesh Kacha
Hi Srikant thanks for the response. I have the following code: hiveContext.sql(insert into... ).coalesce(6) Above code does not create 6 part files it creates around 200 small files. Please guide. Thanks. On Jul 8, 2015 4:07 AM, Srikanth srikanth...@gmail.com wrote: Did you do

回复:HiveContext throws org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2015-07-07 Thread prosp4300
Hi, bdev Derby is the default embedded DB for Hive MetaStore if you do not specify a hive.metastore.uris, please take a look at the lib directory of hive, you can find out derby jar there, Spark does not require derby by default At 2015-07-07 17:07:28, bdev buntu...@gmail.com wrote: Just

Re:Maintain Persistent Connection with Hive meta store

2015-07-07 Thread prosp4300
Each time you run the jar, a new JVM will be started, maintain connection between different JVM is not a correct way to think of each time when I run that jar it tries to make connection with hive metastore At 2015-07-07 17:07:06, wazza rajeshkumarit8...@gmail.com wrote: Hi I am new to

  1   2   >