Re: spark-submit stuck and no output in console

2015-11-16 Thread Kayode Odeyemi
Spark 1.5.1 The fact is that there's no stack trace. No output from that command at all to the console. This is all I get: hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ tail -1 /tmp/spark-profile-job.log nohup: ignoring input /usr/local/spark/bin/spark-class: line 76: 29516 Killed

Re: spark-submit stuck and no output in console

2015-11-16 Thread Ted Yu
Which release of Spark are you using ? Can you take stack trace and pastebin it ? Thanks On Mon, Nov 16, 2015 at 5:50 AM, Kayode Odeyemi wrote: > ./spark-submit --class com.migration.UpdateProfiles --executor-memory 8g > ~/migration-profiles-0.1-SNAPSHOT.jar > > is stuck

[Spark-Avro] Question related to the Avro data generated by Spark-Avro

2015-11-16 Thread java8964
Hi, I have one question related to Spark-Avro, not sure if here is the best place to ask. I have the following Scala Case class, populated with the data in the Spark application, and I tried to save it as AVRO format in the HDFS case class Claim ( ..) case class Coupon ( account_id: Long

Re: Join and HashPartitioner question

2015-11-16 Thread Erwan ALLAIN
You may need to persist r1 after partitionBy call. second join will be more efficient. On Mon, Nov 16, 2015 at 2:48 PM, Rishi Mishra wrote: > AFAIK and can see in the code both of them should behave same. > > On Sat, Nov 14, 2015 at 2:10 AM, Alexander Pivovarov

Re: spark-submit stuck and no output in console

2015-11-16 Thread Jonathan Kelly
He means for you to use jstack to obtain a stacktrace of all of the threads. Or are you saying that the Java process never even starts? On Mon, Nov 16, 2015 at 7:48 AM, Kayode Odeyemi wrote: > Spark 1.5.1 > > The fact is that there's no stack trace. No output from that

Re: Spark Expand Cluster

2015-11-16 Thread Dinesh Ranganathan
Hi Sab, I did not specify number of executors when I submitted the spark application. I was in the impression spark looks at the cluster and figures out the number of executors it can use based on the cluster size automatically, is this what you call dynamic allocation?. I am spark newbie, so

Re: Spark Expand Cluster

2015-11-16 Thread Sabarish Sasidharan
Spark will use the number of executors you specify in spark-submit. Are you saying that Spark is not able to use more executors after you modify it in spark-submit? Are you using dynamic allocation? Regards Sab On Mon, Nov 16, 2015 at 5:54 PM, dineshranganathan < dineshranganat...@gmail.com>

[POWERED BY] Please add our organization

2015-11-16 Thread Adrien Mogenet
Name: Content Square URL: http://www.contentsquare.com Description: We use Spark to regularly read raw data, convert them into Parquet, and process them to create advanced analytics dashboards: aggregation, sampling, statistics computations, anomaly detection, machine learning. -- *Adrien

Spark Expand Cluster

2015-11-16 Thread dineshranganathan
I have my Spark application deployed on AWS EMR on yarn cluster mode. When I increase the capacity of my cluster by adding more Core instances on AWS, I don't see Spark picking up the new instances dynamically. Is there anything I can do to tell Spark to pick up the newly added boxes?? Dan --

Re: DynamoDB Connector?

2015-11-16 Thread Nick Pentreath
See this thread for some info: http://apache-spark-user-list.1001560.n3.nabble.com/DynamoDB-input-source-td8814.html I don't think the situation has changed that much - if you're using Spark on EMR, then I think the InputFormat is available in a JAR (though I haven't tested that). Otherwise

How 'select name,age from TBL_STUDENT where age = 37' is optimized when caching it

2015-11-16 Thread Todd
Hi, When I cache the dataframe and run the query, val df = sqlContext.sql("select name,age from TBL_STUDENT where age = 37") df.cache() df.show println(df.queryExecution) I got the following execution plan,from the optimized logical plan,I can see the whole analyzed logical

spark-submit stuck and no output in console

2015-11-16 Thread Kayode Odeyemi
./spark-submit --class com.migration.UpdateProfiles --executor-memory 8g ~/migration-profiles-0.1-SNAPSHOT.jar is stuck and outputs nothing to the console. What could be the cause of this? Current max heap size is 1.75g and it's only using 1g.

Re: Join and HashPartitioner question

2015-11-16 Thread Rishi Mishra
AFAIK and can see in the code both of them should behave same. On Sat, Nov 14, 2015 at 2:10 AM, Alexander Pivovarov wrote: > Hi Everyone > > Is there any difference in performance btw the following two joins? > > > val r1: RDD[(String, String]) = ??? > val r2: RDD[(String,

Re: bin/pyspark SparkContext is missing?

2015-11-16 Thread Andy Davidson
Thanks andy From: Davies Liu Date: Friday, November 13, 2015 at 3:42 PM To: Andrew Davidson Cc: "user @spark" Subject: Re: bin/pyspark SparkContext is missing? > You forgot to create a SparkContext instance: >

Spark Powered By Page

2015-11-16 Thread Alex Rovner
I would like to list our organization on the Powered by Page. Company: Magnetic Description: We are leveraging Spark Core, Streaming and YARN to process our massive datasets. *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * *

Spark SQL UDAF works fine locally, OutOfMemory on YARN

2015-11-16 Thread Alex Nastetsky
Hi, I am using Spark 1.5.1. I have a Spark SQL UDAF that works fine on a tiny dataset (13 narrow rows) in local mode, but runs out of memory on YARN about half the time (OutOfMemory: Java Heap Space). The rest of the time, it works on YARN. Note that in all instances, the input data is the

Re: How 'select name,age from TBL_STUDENT where age = 37' is optimized when caching it

2015-11-16 Thread Xiao Li
Your dataframe is cached. Thus, your plan is stored as an InMemoryRelation. You can read the logics in CacheManager.scala. Good luck, Xiao Li 2015-11-16 6:35 GMT-08:00 Todd : > Hi, > > When I cache the dataframe and run the query, > > val df = sqlContext.sql("select

Re: spark-submit stuck and no output in console

2015-11-16 Thread Kayode Odeyemi
> Or are you saying that the Java process never even starts? Exactly. Here's what I got back from jstack as expected: hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ jstack 31316 31316: Unable to open socket file: target process not responding or HotSpot VM not loaded The -F option can be

Re: send transformed RDD to s3 from slaves

2015-11-16 Thread Walrus theCat
Update: You can now answer this on stackoverflow for 100 bounty: http://stackoverflow.com/questions/33704073/how-to-send-transformed-data-from-partitions-to-s3 On Fri, Nov 13, 2015 at 4:56 PM, Walrus theCat wrote: > Hi, > > I have an RDD which crashes the driver when

[SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Nipun Arora
Hi, I wanted to understand forEachPartition logic. In the code below, I am assuming the iterator is executing in a distributed fashion. 1. Assuming I have a stream which has timestamp data which is sorted. Will the stringiterator in foreachPartition process each line in order? 2. Assuming I have

Parallelizing operations using Spark

2015-11-16 Thread Susheel Kumar
Hello Spark Users, My first email to spark mailing list and looking forward. I have been working on Solr and in the past have used Java thread pooling to parallelize Solr indexing using SolrJ. Now i am again working on indexing data and this time from JSON files (in 100 thousands) and before I

Re: YARN Labels

2015-11-16 Thread Ted Yu
There is no such configuration parameter for selecting which nodes the application master is running on. Cheers On Mon, Nov 16, 2015 at 12:52 PM, Alex Rovner wrote: > I was wondering if there is analogues configuration parameter to >

Re: [SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Cody Koeninger
Ordering would be on a per-partition basis, not global ordering. You typically want to acquire resources inside the foreachpartition closure, just before handling the iterator. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Mon, Nov

how can evenly distribute my records in all partition

2015-11-16 Thread prateek arora
Hi I have a RDD with 30 record ( Key/value pair ) and running 30 executor . i want to reparation this RDD in to 30 partition so every partition get one record and assigned to one executor . when i used rdd.repartition(30) its repartition my rdd in 30 partition but some partition get 2 record ,

YARN Labels

2015-11-16 Thread Alex Rovner
I was wondering if there is analogues configuration parameter to "spark.yarn.executor.nodeLabelExpression" which restricts which nodes the application master is running on. One of our clusters runs on AWS with a portion of the nodes being spot nodes. We would like to force the application master

Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about """ 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing the benefits to many tree based

Re: Hive on Spark orc file empty

2015-11-16 Thread Deepak Sharma
Sai, I am bit confused here. How are you using write with results? I am using spark 1.4.1 and when i use write , it complains about write not being member of DataFrame. error:value write is not a member of org.apache.spark.sql.DataFrame Thanks Deepak On Mon, Nov 16, 2015 at 4:10 PM, 张炜

Re: YARN Labels

2015-11-16 Thread Ted Yu
Wangda, YARN committer, told me that support for selecting which nodes the application master is running on is integrated to the upcoming hadoop 2.8.0 release. Stay tuned. On Mon, Nov 16, 2015 at 1:36 PM, Ted Yu wrote: > There is no such configuration parameter for

Re: YARN Labels

2015-11-16 Thread Saisai Shao
Node label for AM is not yet supported for Spark now, currently only executor is supported. On Tue, Nov 17, 2015 at 7:57 AM, Ted Yu wrote: > Wangda, YARN committer, told me that support for selecting which nodes the > application master is running on is integrated to the

Re: Conf Settings in Mesos

2015-11-16 Thread Jo Voordeckers
I have run in a related issue I think, args passed to spark-submit to my cluster dispatcher get lost in translation when lauching the driver from mesos, I'm suggesting this patch: https://github.com/jayv/spark/commit/b2025ddc1d565d1cc3036200fc3b3046578f4b02 - Jo Voordeckers On Thu, Nov 12,

Stage retry limit

2015-11-16 Thread pnpritchard
In my app, I see a condition where a stage fails and Spark retries it endlessly. I see the configuration for task retry limit (spark.task.maxFailures), but is there a configuration for limiting the number of stage retries? -- View this message in context:

Spark Job is getting killed after certain hours

2015-11-16 Thread Nikhil Gs
Hello Team, Below is the error which we are facing in our cluster after 14 hours of starting the spark submit job. Not able to understand the issue and why its facing the below error after certain time. If any of you have faced the same scenario or if you have any idea then please guide us. To

Re: Spark-shell connecting to Mesos stuck at sched.cpp

2015-11-16 Thread Jo Voordeckers
I've seen this issue when the mesos cluster couldn't figure out my IP address correctly, have you tried setting the ENV var with your IP address when launching spark or mesos cluster dispatcher like: LIBPROCESS_IP="172.16.0.180" - Jo Voordeckers On Sun, Nov 15, 2015 at 6:59 PM, Jong Wook Kim

Mesos cluster dispatcher doesn't respect most args from the submit req

2015-11-16 Thread Jo Voordeckers
Hi all, I'm running the mesos cluster dispatcher, however when I submit jobs with things like jvm args, classpath order and UI port aren't added to the commandline executed by the mesos scheduler. In fact it only cares about the class, jar and num cores/mem.

Re: how can evenly distribute my records in all partition

2015-11-16 Thread Sabarish Sasidharan
You can write your own custom partitioner to achieve this Regards Sab On 17-Nov-2015 1:11 am, "prateek arora" wrote: > Hi > > I have a RDD with 30 record ( Key/value pair ) and running 30 executor . i > want to reparation this RDD in to 30 partition so every

Re: Spark Job is getting killed after certain hours

2015-11-16 Thread Vipul Rai
Hi Nikhil, It seems you have Kerberos enabled cluster and it is unable to authenticate using the ticket. Please check the Kerberos settings, it could also be because of Kerberos version mismatch on nodes. Thanks, Vipul On Tue 17 Nov, 2015 07:31 Nikhil Gs wrote: >

Re: Spark Job is getting killed after certain hours

2015-11-16 Thread Ilya Ganelin
Your Kerberos cert is likely expiring. Check your expiration settings. -Ilya Ganelin On Mon, Nov 16, 2015 at 9:20 PM, Vipul Rai wrote: > Hi Nikhil, > It seems you have Kerberos enabled cluster and it is unable to > authenticate using the ticket. > Please check the

Re: Hive on Spark orc file empty

2015-11-16 Thread 张炜
Hi Deepak and all, write() is a function of DataFrame, please check https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrame.html , the last one is write() The problem is write to HDFS is successful but esults.write.format("orc").save("yahoo_stocks_orc") has empty folder. Could

Re: No spark examples jar in maven repository after 1.1.1 ?

2015-11-16 Thread Sean Owen
I think because they're not a library? they're example code, not something you build an app on. On Mon, Nov 16, 2015 at 9:27 AM, Jeff Zhang wrote: > I don't find spark examples jar in maven repository after 1.1.1. Any reason > for that ? > >

How to enable MetricsServlet sink in Spark 1.5.0?

2015-11-16 Thread ihavethepotential
Hi all, I am trying to get the metrics using the MetricsServlet sink(that I guess is enabled by default) as mentioned in the Spark documentation: "5. MetricsServlet is added by default as a sink in master, worker and client # driver, you can send http request "/metrics/json" to get a snapshot of

Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Zhang, Jingyu
I am using spark-csv to save files in s3, it shown Size exceeds. Please let me know how to fix it. Thanks. df.write() .format("com.databricks.spark.csv") .option("header", "true") .save("s3://newcars.csv"); java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at

RE: Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Ewan Leith
How big do you expect the file to be? Spark has issues with single blocks over 2GB (see https://issues.apache.org/jira/browse/SPARK-1476 and https://issues.apache.org/jira/browse/SPARK-6235 for example) If you don’t know, try running df.repartition(100).write.format… to get an idea of how

Hive on Spark orc file empty

2015-11-16 Thread 张炜
Dear all, I am following this article to try Hive on Spark http://hortonworks.com/hadoop-tutorial/using-hive-with-orc-from-apache-spark/ My environment: Hive 1.2.1 Spark 1.5.1 in a nutshell, I ran spark-shell, created a hive table hiveContext.sql("create table yahoo_orc_table (date STRING,

Re: No spark examples jar in maven repository after 1.1.1 ?

2015-11-16 Thread Jeff Zhang
But it may be useful for user to check the example source code in IDE just by adding it to maven dependency. Otherwise user have to either download the source code or check it in github. On Mon, Nov 16, 2015 at 5:32 PM, Sean Owen wrote: > I think because they're not a

Re: How to enable MetricsServlet sink in Spark 1.5.0?

2015-11-16 Thread rakesh rakshit
Hi Saisai, As I mentioned, I am getting very less information using the /metrics/json URI and *also the /metrics/master/json and /metrics/applications/json URIs do not seem to be working*. Please verify the following at your end. The metrics dumped using /metrics/json is as folllows: {

No spark examples jar in maven repository after 1.1.1 ?

2015-11-16 Thread Jeff Zhang
I don't find spark examples jar in maven repository after 1.1.1. Any reason for that ? http://mvnrepository.com/artifact/org.apache.spark/spark-examples_2.10 -- Best Regards Jeff Zhang

Re: How to enable MetricsServlet sink in Spark 1.5.0?

2015-11-16 Thread Saisai Shao
it should worked. I tested in my local environment with "curl http://localhost:4040/metrics/json/;, there's metrics dumped. For cluster metrics, you have to change your base url to point to cluster manager. Thanks Jerry On Mon, Nov 16, 2015 at 5:42 PM, ihavethepotential <

Re: Size exceeds Integer.MAX_VALUE (SparkSQL$TreeNodeException: sort, tree) on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Zhang, Jingyu
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree: Sort [net_site#50 ASC,device#6 ASC], true Exchange (RangePartitioning 200) Project [net_site#50,device#6,total_count#105L,adblock_count#106L,noanalytics_count#107L,unique_nk_count#109L] HashOuterJoin

ISDATE Function

2015-11-16 Thread Ravisankar Mani
Hi Everyone, In MSSQL server suppprt "ISDATE()" function is used to fine current column values date or not?. Is any possible to achieve current column values date or not? Regards, Ravi

Avro RDD to DataFrame

2015-11-16 Thread Deenar Toraskar
Hi The spark-avro module supports creation of a DataFrame from avro files. How can convert a RDD of Avro objects that i get via SparkStreaming into a DataFrame? val avroStream = KafkaUtils.createDirectStream[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](..)

Reading non UTF-8 files via spark streaming

2015-11-16 Thread tarek_abouzeid
Hi, i am trying to read files which are ISO-8859-6 encoded via spark streaming, but the default encoding for " ssc.textFileStream " is UTF-8 , so i don't get the data properly , so is there a way change the default encoding for textFileStream , or a way to read the file's bytes then i can