Re: Unable to set cores while submitting Spark job

2016-03-31 Thread Mich Talebzadeh
Hi Shridhar Can you check on Spark GUI whether the number of cores shown per worker is the same as you set up? This shows under column "Cores" HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Unable to set cores while submitting Spark job

2016-03-31 Thread vetal king
Ted, Mich, Thanks for your replies. I ended up using sparkConf.set(); and accepted cores as a parameter. But still not sure why spark-submits's executor-cores or driver-cores property did not work. setting cores within main method seems to be bit cumbersome . Thanks again, Shridhar On Wed, Mar

How to release data frame to avoid memory leak

2016-03-31 Thread kramer2...@126.com
Hi I have data frames created every 5 minutes. I use a dict to keep the recent 1 hour data frames. So only 12 data frame can be kept in the dict. New data frame come in, old data frame pop out. My question is when I pop out the old data frame, do I have to call dataframe.unpersist to release the

SPARK-1.6 build with HIVE

2016-03-31 Thread guoqing0...@yahoo.com.hk.INVALID
Hi, I'd like to know is the SPARK-1.6 only support the hive-0.13 or can build with higher versions like 1.x ? guoqing0...@yahoo.com.hk

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Abhishek Anand
This is what I am getting in the executor logs 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file

spark-shell failing but pyspark works

2016-03-31 Thread Cyril Scetbon
Hi, I'm having issues to create a StreamingContext with Scala using spark-shell. It tries to access the localhost interface and the Application Master is not running on this interface : ERROR ApplicationMaster: Failed to connect to driver at localhost:47257, retrying ... I don't have the

Re: Execution error during ALS execution in spark

2016-03-31 Thread buring
I have some suggestions you may try 1) input RDD ,use the persist method ,this may much save running time 2) from the UI,you can see cluster spend much time in shuffle stage , this can adjust through some conf parameters ,such as" spark.shuffle.memoryFraction" "spark.memory.fraction" good luck

Apache Spark-Get All Field Names From Nested Arbitrary JSON Files

2016-03-31 Thread John Radin
Hello All- I have run into a somewhat perplexing issue that has plagued me for several months now (with haphazard workarounds). I am trying to create an Avro Schema (schema-enforced format for serializing arbitrary data, basically, as I understand it) to convert some complex JSON files (arbitrary

Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-03-31 Thread Jacek Laskowski
Hi Ted, Sure! It works with map, but not with select. Wonder if it's by design or...will soon be fixed? Thanks again for your help. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at

Re: [SQL] A bug with withColumn?

2016-03-31 Thread Jacek Laskowski
Hi, Thanks Ted. It means that it's not only possible to rename a column using withColumnRenamed, but also replace the content of a column (in one shot) using withColumn with an existing column name. I can live with that :) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/

Re: [SQL] A bug with withColumn?

2016-03-31 Thread Ted Yu
Looks like this is result of the following check: val shouldReplace = output.exists(f => resolver(f.name, colName)) if (shouldReplace) { where existing column, text, was replaced. On Thu, Mar 31, 2016 at 12:08 PM, Jacek Laskowski wrote: > Hi, > > Just ran into the

Spark process creating and writing to a Hive ORC table

2016-03-31 Thread Ashok Kumar
Hello, How feasible is to use Spark to extract csv files and creates and writes the content to an ORC table in a Hive database. Is Parquet file the best (optimum) format to write to HDFS from Spark app. Thanks

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Ted Yu
Can you show the stack trace ? The log message came from DiskBlockObjectWriter#revertPartialWritesAndClose(). Unfortunately, the method doesn't throw exception, making it a bit hard for caller to know of the disk full condition. On Thu, Mar 31, 2016 at 11:32 AM, Abhishek Anand

OutOfMemory with wide (289 column) dataframe

2016-03-31 Thread ludflu
I'm building a spark job against Spark 1.6.0 / EMR 4.4 in Scala. I'm attempting to concat a bunch of dataframe columns then explode them into new rows. (just using the built in concat and explode functions) Works great in my unit test. But I get out of memory issues when I run against my

Re: Problem with jackson lib running on spark

2016-03-31 Thread Marcelo Oikawa
> Please exclude jackson-databind - that was where the AnnotationMap class > comes from. > I tried as you suggest but i getting the same error. Seems strange because when I see the generated jar there is nothing related as AnnotationMap but there is a databind there. ​ > > On Thu, Mar 31,

[SQL] A bug with withColumn?

2016-03-31 Thread Jacek Laskowski
Hi, Just ran into the following. Is this a bug? scala> df.join(nums, df("id") === nums("id")).withColumn("TEXT2", lit(5)).show +---+---+---+-+-+ | id| text| id| text|TEXT2| +---+---+---+-+-+ | 0| hello| 0| two|5| | 1|swiecie| 1|three|5|

Re: Problem with jackson lib running on spark

2016-03-31 Thread Ted Yu
Please exclude jackson-databind - that was where the AnnotationMap class comes from. On Thu, Mar 31, 2016 at 11:37 AM, Marcelo Oikawa < marcelo.oik...@webradar.com> wrote: > Hi, Alonso. > > As you can see jackson-core is provided by several libraries, try to >> exclude it from spark-core, i

Re: Problem with jackson lib running on spark

2016-03-31 Thread Marcelo Oikawa
Hi, Alonso. As you can see jackson-core is provided by several libraries, try to > exclude it from spark-core, i think the minor version is included within > it. > There is no more than one jackson-core provides by spark-core. There are jackson-core and jackson-core-asl but are differents

Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Abhishek Anand
Hi, Why is it so that when my disk space is full on one of the workers then the executor on that worker becomes unresponsive and the jobs on that worker fails with the exception 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file

Re: Calling spark from a java web application.

2016-03-31 Thread Ricardo Paiva
$SPARK_HOME/conf/log4j.properties It uses by default $SPARK_HOME/conf/log4j.properties.template On Thu, Mar 31, 2016 at 3:28 PM, arul_anand_2000 [via Apache Spark User List] wrote: > Can you please let me know how the log4j properties where configured. I

Re: Spark master keeps running out of RAM

2016-03-31 Thread Josh Rosen
One possible cause of a standalone master OOMing is https://issues.apache.org/jira/browse/SPARK-6270. In 2.x, this will be fixed by https://issues.apache.org/jira/browse/SPARK-12299. In 1.x, one mitigation is to disable event logging. Another workaround would be to produce a patch which disables

Spark master keeps running out of RAM

2016-03-31 Thread Dillian Murphey
Why would the spark master run out of RAM if I have too many slaves? Is this a flaw in the coding? I'm just a user of spark. The developer that set this up left the company, so I'm starting from the top here. So I noticed if I spawn lots of jobs, my spark master ends up crashing due to low

Re: Problem with jackson lib running on spark

2016-03-31 Thread Alonso Isidoro Roman
As you can see jackson-core is provided by several libraries, try to exclude it from spark-core, i think the minor version is included within it. Use this guide to see how to do it: https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html Alonso

Re: Problem with jackson lib running on spark

2016-03-31 Thread Marcelo Oikawa
Hey, Alonso. here is the output: [INFO] spark-processor:spark-processor-druid:jar:1.0-SNAPSHOT [INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.6.1:provided [INFO] | +- org.apache.spark:spark-core_2.10:jar:1.6.1:provided [INFO] | | +-

Re: Problem with jackson lib running on spark

2016-03-31 Thread Alonso Isidoro Roman
Run mvn dependency:tree and print the output here, i suspect that jackson library is included within more than one dependency. Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de

Re: Problem with jackson lib running on spark

2016-03-31 Thread Marcelo Oikawa
Hey, Ted 2.4.4 > > Looks like Tranquility uses different version of jackson. > > How do you build your jar ? > I'm building a jar with dependencies using the maven assembly plugin. Below is all jackson's dependencies: [INFO]

Re: transformation - spark vs cassandra

2016-03-31 Thread Femi Anthony
Try it out on a smaller subset of data and see which gives the better performance. On Thu, Mar 31, 2016 at 12:11 PM, Arun Sethia wrote: > Thanks Imre. > > But I thought spark-cassandra driver is going to do same internally. > > On Thu, Mar 31, 2016 at 10:32 AM, Imre Nagi

Re: confusing about Spark SQL json format

2016-03-31 Thread Femi Anthony
I encountered a similar problem reading multi-line JSON files into Spark a while back, and here's an article I wrote about how to solve it: http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ You may find it useful. Femi On Thu, Mar 31, 2016 at 12:32 PM,

Re: Problem with jackson lib running on spark

2016-03-31 Thread Ted Yu
Spark 1.6.1 uses this version of jackson: 2.4.4 Looks like Tranquility uses different version of jackson. How do you build your jar ? Consider using maven-shade-plugin to resolve the conflict if you use maven. Cheers On Thu, Mar 31, 2016 at 9:50 AM, Marcelo Oikawa

Problem with jackson lib running on spark

2016-03-31 Thread Marcelo Oikawa
Hi, list. We are working on a spark application that sends messages to Druid. For that, we're using Tranquility core. In my local test, I'm using the "spark-1.6.1-bin-hadoop2.6" distribution and the following dependencies in my app: org.apache.spark spark-streaming_2.10 1.6.1

Re: confusing about Spark SQL json format

2016-03-31 Thread Ross.Cramblit
You are correct that it does not take the standard JSON file format. From the Spark Docs: "Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will

Re: transformation - spark vs cassandra

2016-03-31 Thread Arun Sethia
Thanks Imre. But I thought spark-cassandra driver is going to do same internally. On Thu, Mar 31, 2016 at 10:32 AM, Imre Nagi wrote: > I think querying by cassandra query language will be better in terms of > performance if you want to pull and filter the data from

Re: Concurrent Spark jobs

2016-03-31 Thread emlyn
In case anyone else has the same problem and finds this - in my case it was fixed by increasing spark.sql.broadcastTimeout (I used 9000). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011p26648.html Sent from the Apache Spark

Re: transformation - spark vs cassandra

2016-03-31 Thread Imre Nagi
I think querying by cassandra query language will be better in terms of performance if you want to pull and filter the data from your db, rather than pulling all of the data and do some filtering and transformation by using spark data frame. On 31 Mar 2016 22:19, "asethia"

transformation - spark vs cassandra

2016-03-31 Thread asethia
Hi, I am working with Cassandra and Spark, would like to know what is best performance using Cassandra filter based on primary key and cluster key vs using spark data frame transformation/filters. for example in spark: val rdd = sqlContext.read.format("org.apache.spark.sql.cassandra")

RE: SPARK-13900 - Join with simple OR conditions take too long

2016-03-31 Thread Yong Zhang
I agree that there won't be a generic solution for these kind of cases. Without the CBO from Spark or Hadoop ecosystem in short future, maybe Spark DataFrame/SQL should support more hints from the end user, as in these cases, end users will be smart enough to tell the engine what is the correct

Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-03-31 Thread Ted Yu
I tried this: scala> final case class Text(id: Int, text: String) warning: there was one unchecked warning; re-run with -unchecked for details defined class Text scala> val ds = Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text] ds: org.apache.spark.sql.Dataset[Text] = [id: int, text: string]

Re: Restart App and consume from checkpoint using direct kafka API

2016-03-31 Thread Cody Koeninger
Long story short, no. Don't rely on checkpoints if you cant handle reprocessing some of your data. On Thu, Mar 31, 2016 at 3:02 AM, Imre Nagi wrote: > I'm dont know how to read the data from the checkpoint. But AFAIK and based > on my experience, I think the best thing

Re:Re: How to design the input source of spark stream

2016-03-31 Thread 李明伟
Hi Anthony Thanks. You are right the api will read all files, no need to merge At 2016-03-31 20:09:25, "Femi Anthony" wrote: Also, ssc.textFileStream(dataDir) will read all the files from a directory so as far as I can see there's no need to merge the files. Just

Re: Does Spark CSV accept a CSV String

2016-03-31 Thread Mich Talebzadeh
well my guess is just pkunzip it and use bzip2 to zip it or leave it as it is. Databricks handles *.bz2 type files. I know that. Anyway that is the easy part :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-31 Thread Eugene Morozov
Joseph, Correction, there 20k features. Is it still a lot? What number of features can be considered as normal? -- Be well! Jean Morozov On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley wrote: > First thought: 70K features is *a lot* for the MLlib implementation (and >

Re: Spark for Log Analytics

2016-03-31 Thread ashish rawat
Thanks for your replies Steve and Chris. Steve, I am creating a real-time pipeline, so I am not looking to dump data to hdfs rite now. Also, since the log sources would be Nginx, Mongo and application events, it might not be possible to always route events directly from the source to flume.

Re: Spark streaming spilling all the data to disk even if memory available

2016-03-31 Thread Akhil Das
Use StorageLevel MEMORY_ONLY. Also have a look at the createDirectStream API. Most likely in your case your batch duration must be less than your processing time and the addition of delay probably blows up the memory. On Mar 31, 2016 6:13 PM, "Mayur Mohite" wrote: > We

Re: Spark streaming spilling all the data to disk even if memory available

2016-03-31 Thread Mayur Mohite
We are using KafkaUtils.createStream API to read data from kafka topics and we are using StorageLevel.MEMORY_AND_DISK_SER option while configuring kafka streams. On Wed, Mar 30, 2016 at 12:58 PM, Akhil Das wrote: > Can you elaborate more on from where you are

Re: How to design the input source of spark stream

2016-03-31 Thread Femi Anthony
Also, ssc.textFileStream(dataDir) will read all the files from a directory so as far as I can see there's no need to merge the files. Just write them to the same HDFS directory. On Thu, Mar 31, 2016 at 8:04 AM, Femi Anthony wrote: > I don't think you need to do it this way.

Re: How to design the input source of spark stream

2016-03-31 Thread Femi Anthony
I don't think you need to do it this way. Take a look here : http://spark.apache.org/docs/latest/streaming-programming-guide.html in this section: Level of Parallelism in Data Receiving Receiving multiple data streams can therefore be achieved by creating multiple input DStreams and configuring

Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-03-31 Thread Jacek Laskowski
Hi, I can't seem to use Dataset using case classes (or tuples) to select per field: scala> final case class Text(id: Int, text: String) warning: there was one unchecked warning; re-run with -unchecked for details defined class Text scala> val ds = Seq(Text(0, "hello"), Text(1,

Execution error during ALS execution in spark

2016-03-31 Thread pankajrawat
Hi, While building Recommendation engine using spark MLlib (ALS) we are facing some issues during execution. Details are below. We are trying to train our model on 1.4 million sparse rating records (1,00, 000 customer X 50,000 items). The execution DAG cycle is taking a long time and is

Re: Spark for Log Analytics

2016-03-31 Thread Chris Fregly
oh, and I forgot to mention Kafka Streams which has been heavily talked about the last few days at Strata here in San Jose. Streams can simplify a lot of this architecture by perform some light-to-medium-complex transformations in Kafka itself. i'm waiting anxiously for Kafka 0.10 with

Re: Spark for Log Analytics

2016-03-31 Thread Chris Fregly
this is a very common pattern, yes. note that in Netflix's case, they're currently pushing all of their logs to a Fronting Kafka + Samza Router which can route to S3 (or HDFS), ElasticSearch, and/or another Kafka Topic for further consumption by internal apps using other technologies like Spark

Execution error during ALS execution in spark

2016-03-31 Thread Pankaj Rawat
Hi, While building Recommendation engine using spark MLlib (ALS) we are facing some issues during execution. Details are below. We are trying to train our model on 1.4 million sparse rating records (1,00, 000 customer X 50,000 items). The execution DAG cycle is taking a long time and is

Re: Spark for Log Analytics

2016-03-31 Thread Steve Loughran
On 31 Mar 2016, at 09:37, ashish rawat > wrote: Hi, I have been evaluating Spark for analysing Application and Server Logs. I believe there are some downsides of doing this: 1. No direct mechanism of collecting log, so need to introduce other

Re: Read Parquet in Java Spark

2016-03-31 Thread Ramkumar V
Hi, Thanks for the reply. I tried this. It's returning JavaRDD instead of JavaRDD. How to get JavaRDD ? Error : incompatible types: org.apache.spark.api.java.JavaRDD cannot be converted to org.apache.spark.api.java.JavaRDD *Thanks*, On Thu, Mar

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-03-31 Thread Hemant Bhanawat
Hi Ashok, That's interesting. As I understand, on table A and B, a nested loop join (that will produce m X n rows) is performed and than each row is evaluated to see if any of the condition is met. You are asking that Spark should instead do a BroadcastHashJoin on the equality conditions in

Re: Unable to Run Spark Streaming Job in Hadoop YARN mode

2016-03-31 Thread Ted Yu
Looking through https://spark.apache.org/docs/latest/configuration.html#spark-streaming , I don't see config specific to YARN. Can you pastebin the exception you saw ? When the job stopped, was there any error ? Thanks On Wed, Mar 30, 2016 at 10:57 PM, Soni spark

Re: Read Parquet in Java Spark

2016-03-31 Thread UMESH CHAUDHARY
>From Spark Documentation: DataFrame parquetFile = sqlContext.read().parquet("people.parquet"); JavaRDD jRDD= parquetFile.javaRDD() javaRDD() method will convert the DF to RDD On Thu, Mar 31, 2016 at 2:51 PM, Ramkumar V wrote: > Hi, > > I'm trying to read parquet log

Read Parquet in Java Spark

2016-03-31 Thread Ramkumar V
Hi, I'm trying to read parquet log files in Java Spark. Parquet log files are stored in hdfs. I want to read and convert that parquet file into JavaRDD. I could able to find Sqlcontext dataframe api. How can I read if it is sparkcontext and rdd ? what is the best way to read it ? *Thanks*,

Re: confusing about Spark SQL json format

2016-03-31 Thread UMESH CHAUDHARY
Hi Charles, The definition of object from www.json.org: An *object* is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma). Its a pretty much OOPS

Re: confusing about Spark SQL json format

2016-03-31 Thread charles li
hi, UMESH, I think you've misunderstood the json definition. there is only one object in a json file: for the file, people.json, as bellow: {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}

SPARK-13900 - Join with simple OR conditions take too long

2016-03-31 Thread ashokkumar rajendran
Hi, I have filed ticket SPARK-13900. There was an initial reply from a developer but did not get any reply on this. How can we do multiple hash joins together for OR conditions based joins? Could someone please guide on how can we fix this? Regards Ashok

Re: confusing about Spark SQL json format

2016-03-31 Thread Hechem El Jed
Hello, Actually I have been through the same problem as you when I was implementing a decision tree algorithm with Spark parsing the output to a comprehensible json format. So as you said; the correct json format is : [{ "name": "Yin", "address": { "city": "Columbus",

Re: confusing about Spark SQL json format

2016-03-31 Thread UMESH CHAUDHARY
Hi, Look at below image which is from json.org : [image: Inline image 1] The above image describes the object formulation of below JSON: Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}} Object=> {"name":"Michael", "address":{"city":null, "state":"California"}} Note that

Spark for Log Analytics

2016-03-31 Thread ashish rawat
Hi, I have been evaluating Spark for analysing Application and Server Logs. I believe there are some downsides of doing this: 1. No direct mechanism of collecting log, so need to introduce other tools like Flume into the pipeline. 2. Need to write lots of code for parsing different patterns from

confusing about Spark SQL json format

2016-03-31 Thread charles li
as this post says, that in spark, we can load a json file in this way bellow: *post* : https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html ---

Re: Restart App and consume from checkpoint using direct kafka API

2016-03-31 Thread Imre Nagi
I'm dont know how to read the data from the checkpoint. But AFAIK and based on my experience, I think the best thing that you can do is storing the offset to a particular storage such as database everytime you consume the message. Then read the offset from the database everytime you want to start

Restart App and consume from checkpoint using direct kafka API

2016-03-31 Thread vimal dinakaran
Hi, In the blog https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md It is mentioned that enabling checkpoint works as long as the app jar is unchanged. If I want to upgrade the jar with the latest code and consume from kafka where it was stopped , how to do that ? Is there a

Re: No active SparkContext

2016-03-31 Thread Max Schmidt
Just to mark this question closed - we expierienced an OOM-Exception on the Master, which we didn't see on the Driver, but made him crash. Am 24.03.2016 um 09:54 schrieb Max Schmidt: > Hi there, > > we're using with the java-api (1.6.0) a ScheduledExecutor that > continuously executes a SparkJob

Re: Exposing dataframe via thrift server

2016-03-31 Thread Marco Colombo
Is context that is registering the temp table still active? Il giovedì 31 marzo 2016, ram kumar ha scritto: > Hi, > > I started thrift server > cd $SPARK_HOME > ./sbin/start-thriftserver.sh > > Then, jdbc client > $ ./bin/beeline > Beeline version 1.5.2 by Apache Hive >