Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-23 Thread Emre Sevinc
Hello, According to Spark Documentation at https://spark.apache.org/docs/1.2.1/submitting-applications.html : --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown). And indeed, when I use that parameter, in my

Re: Saving Dstream into a single file

2015-03-23 Thread Dean Wampler
You can use the coalesce method to reduce the number of partitions. You can reduce to one if the data is not too big. Then write the output. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com

[no subject]

2015-03-23 Thread Udbhav Agarwal
Hi, I am querying hbase via Spark SQL with java APIs. Step -1 creating JavaPairRdd, then JavaRdd, then JavaSchemaRdd.applySchema objects. Step -2 sqlContext.sql(sql query). If am updating my hbase database between these two steps(by hbase shell in some other console) the query in step two is not

Re: registerTempTable is not a member of RDD on spark 1.2?

2015-03-23 Thread Dean Wampler
In 1.2 it's a member of SchemaRDD and it becomes available on RDD (through the type class mechanism) when you add a SQLContext, like so. val sqlContext = new SQLContext(sc)import sqlContext._ In 1.3, the method has moved to the new DataFrame type. Dean Wampler, Ph.D. Author: Programming Scala,

registerTempTable is not a member of RDD on spark 1.2?

2015-03-23 Thread IT CTO
Hi, I am running spark when I use sc.version I get 1.2 but when I call registerTempTable(MyTable) I get error saying registedTempTable is not a member of RDD Why? -- Eran | CTO

Spark error NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit

2015-03-23 Thread , Roy
Hi, I am using CDH 5.3.2 packages installation through Cloudera Manager 5.3.2 I am trying to run one spark job with following command PYTHONPATH=~/code/utils/ spark-submit --master yarn --executor-memory 3G --num-executors 30 --driver-memory 2G --executor-cores 2 --name=analytics

Re: JAVA_HOME problem with upgrade to 1.3.0

2015-03-23 Thread Williams, Ken
From: Williams, Ken Williams ken.willi...@windlogics.commailto:ken.willi...@windlogics.com Date: Thursday, March 19, 2015 at 10:59 AM To: Spark list user@spark.apache.orgmailto:user@spark.apache.org Subject: JAVA_HOME problem with upgrade to 1.3.0 […] Finally, I go and check the YARN

Re: join two DataFrames, same column name

2015-03-23 Thread Eric Friedman
You can include * and a column alias in the same select clause var df1 = sqlContext.sql(select *, column_id AS table1_id from table1) FYI, this does not ultimately work as the * still includes column_id and you cannot have two columns of that name in the joined DataFrame. So I ended up

Re: Spark 1.2. loses often all executors

2015-03-23 Thread mrm
Hi, I have received three replies to my question on my personal e-mail, why don't they also show up on the mailing list? I would like to reply to the 3 users through a thread. Thanks, Maria -- View this message in context:

Re: EC2 cluster created by spark using old HDFS 1.0

2015-03-23 Thread Akhil Das
That's a hadoop version incompatibility issue, you need to make sure everything runs on the same version. Thanks Best Regards On Sat, Mar 21, 2015 at 1:24 AM, morfious902002 anubha...@gmail.com wrote: Hi, I created a cluster using spark-ec2 script. But it installs HDFS version 1.0. I would

Spark UI tunneling

2015-03-23 Thread sergunok
Is it a way to tunnel Spark UI? I tried to tunnel client-node:4040 but my browser was redirected from localhost to some cluster locally visible domain name.. Maybe there is some startup option to encourage Spark UI be fully accessiable just through single endpoint (address:port)? Serg. --

Re: Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
I caught exceptions in the python UDF code, flush exceptions into a single file, and made sure the the column number of the output lines as same as sql schema. Sth. interesting is that my output line of the UDF code is just 10 columns, and the exception above is

RDD storage in spark steaming

2015-03-23 Thread abhi
HI, i have a simple question about creating RDD . Whenever RDD is created in spark streaming for the particular time window .When does the RDD gets stored . 1. Does it get stored at the Driver machine ? or it gets stored on all the machines in the cluster ? 2. Does the data gets stored in memory

Re: RDD storage in spark steaming

2015-03-23 Thread Jeffrey Jedele
Hey Abhi, many of StreamingContext's methods to create input streams take a StorageLevel parameter to configure this behavior. RDD partitions are generally stored in the in-memory cache of worker nodes I think. You can also configure replication and spilling to disk if needed. Regards, Jeff

Re: join two DataFrames, same column name

2015-03-23 Thread Eric Friedman
Michael, thank you for the workaround and for letting me know of the upcoming enhancements, both of which sound appealing. On Sun, Mar 22, 2015 at 1:25 PM, Michael Armbrust mich...@databricks.com wrote: You can include * and a column alias in the same select clause var df1 =

Re: registerTempTable is not a member of RDD on spark 1.2?

2015-03-23 Thread IT CTO
Thanks. I am new to the environment and running cloudera CDH5.3 with spark in it. apparently when running in spark-shell this command val sqlContext = new SQLContext(sc) I am failing with the not found type SQLContext Any idea why? On Mon, Mar 23, 2015 at 3:05 PM, Dean Wampler

Re: registerTempTable is not a member of RDD on spark 1.2?

2015-03-23 Thread Ted Yu
Have you tried adding the following ? import org.apache.spark.sql.SQLContext Cheers On Mon, Mar 23, 2015 at 6:45 AM, IT CTO goi@gmail.com wrote: Thanks. I am new to the environment and running cloudera CDH5.3 with spark in it. apparently when running in spark-shell this command val

Spark RDD mapped to Hbase to be updateable

2015-03-23 Thread Siddharth Ubale
Hi, We have a JavaRDD mapped to a hbase table and when we query on the Hbase table using Spark-sql API we can access the data. However when we update Hbase table while the SparkSQL SparkConf is intialised we cannot see updated data. Is there any way we can have the RDD mapped to Hbase

RE: Spark SQL udf(ScalaUdf) is very slow

2015-03-23 Thread Cheng, Hao
This is a very interesting issue, the root reason for the lower performance probably is, in Scala UDF, Spark SQL converts the data type from internal representation to Scala representation via Scala reflection recursively. Can you create a Jira issue for tracking this? I can start to work on

Re: Spark 1.2. loses often all executors

2015-03-23 Thread Ted Yu
In this thread: http://search-hadoop.com/m/JW1q5DM69G I only saw two replies. Maybe some people forgot to use 'Reply to All' ? Cheers On Mon, Mar 23, 2015 at 8:19 AM, mrm ma...@skimlinks.com wrote: Hi, I have received three replies to my question on my personal e-mail, why don't they also

Is yarn-standalone mode deprecated?

2015-03-23 Thread nitinkak001
Is yarn-standalone mode deprecated in Spark now. The reason I am asking is because while I can find it in 0.9.0 documentation(https://spark.apache.org/docs/0.9.0/running-on-yarn.html). I am not able to find it in 1.2.0. I am using this mode to run the Spark jobs from Oozie as a java action.

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Sean Owen
Data is not (necessarily) sorted when read from disk, no. A file might have many blocks even, and while a block yields a partition in general, the order in which those partitions appear in the RDD is not defined. This is why you'd sort if you need the data sorted. I think you could conceivably

Re: Spark sql thrift server slower than hive

2015-03-23 Thread Arush Kharbanda
A basis change needed by spark is setting the executor memory which defaults to 512MB by default. On Mon, Mar 23, 2015 at 10:16 AM, Denny Lee denny.g@gmail.com wrote: How are you running your spark instance out of curiosity? Via YARN or standalone mode? When connecting Spark thriftserver

Re: How to use DataFrame with MySQL

2015-03-23 Thread gavin zhang
OK,I found what the problem is: It couldn't work with mysql-connector-5.0.8. I updated the connector version to 5.1.34 and it worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-DataFrame-with-MySQL-tp22178p22182.html Sent from the Apache

Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-23 Thread Alexey Zinoviev
Thanks Ted, I'll try, hope there's no transitive dependencies on 3.2.10. On Tue, Mar 24, 2015 at 4:21 AM, Ted Yu yuzhih...@gmail.com wrote: Looking at core/pom.xml : dependency groupIdorg.json4s/groupId artifactIdjson4s-jackson_${scala.binary.version}/artifactId

RE: Use pig load function in spark

2015-03-23 Thread Dai, Kevin
Hi, Yin But our data is customized sequence file which can be read by our customized load in pig And I want to use spark to reuse these load function to read data and transfer them to the RDD. Best Regards, Kevin. From: Yin Huai [mailto:yh...@databricks.com] Sent: 2015年3月24日 11:53 To: Dai,

Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-23 Thread Alexey Zinoviev
Thanks Marcelo, this options solved the problem (I'm using 1.3.0), but it works only if I remove extends Logging from the object, with extends Logging it return: Exception in thread main java.lang.LinkageError: loader constraint violation in interface itable initialization: when resolving method

Re: Is yarn-standalone mode deprecated?

2015-03-23 Thread Sandy Ryza
The mode is not deprecated, but the name yarn-standalone is now deprecated. It's now referred to as yarn-cluster. -Sandy On Mon, Mar 23, 2015 at 11:49 AM, nitinkak001 nitinkak...@gmail.com wrote: Is yarn-standalone mode deprecated in Spark now. The reason I am asking is because while I can

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Yiannis Gkoufas
Hi Yin, Yes, I have set spark.executor.memory to 8g and the worker memory to 16g without any success. I cannot figure out how to increase the number of mapPartitions tasks. Thanks a lot On 20 March 2015 at 18:44, Yin Huai yh...@databricks.com wrote: spark.sql.shuffle.partitions only control

Re: Spark error NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit

2015-03-23 Thread Ted Yu
InputSplit is in hadoop-mapreduce-client-core jar Please check that the jar is in your classpath. Cheers On Mon, Mar 23, 2015 at 8:10 AM, , Roy rp...@njit.edu wrote: Hi, I am using CDH 5.3.2 packages installation through Cloudera Manager 5.3.2 I am trying to run one spark job with

Re: PySpark, ResultIterable and taking a list and saving it into different parquet files

2015-03-23 Thread chuwiey
In case anyone wants to learn about my solution for this: groupByKey is highly inefficient due to the swapping of elements between the different partitions as well as requiring enough mem in each worker to handle the elements for each group. So instead of using groupByKey, I ended up taking the

Re: Is yarn-standalone mode deprecated?

2015-03-23 Thread Sandy Ryza
The former is deprecated. However, the latter is functionally equivalent to it. Both launch an app in what is now called yarn-cluster mode. Oozie now also has a native Spark action, though I'm not familiar on the specifics. -Sandy On Mon, Mar 23, 2015 at 1:01 PM, Nitin kak

Getting around Serializability issues for types not in my control

2015-03-23 Thread adelbertc
Hey all, I'd like to use the Scalaz library in some of my Spark jobs, but am running into issues where some stuff I use from Scalaz is not serializable. For instance, in Scalaz there is a trait /** In Scalaz */ trait Applicative[F[_]] { def apply2[A, B, C](fa: F[A], fb: F[B])(f: (A, B) = C):

Re: JDBC DF using DB2

2015-03-23 Thread Ted Yu
bq. is to modify compute_classpath.sh on all worker nodes to include your driver JARs. Please follow the above advice. Cheers On Mon, Mar 23, 2015 at 12:34 PM, Jack Arenas j...@ckarenas.com wrote: Hi Team, I’m trying to create a DF using jdbc as detailed here

Re: Spark per app logging

2015-03-23 Thread Udit Mehta
Yes each application can use its own log4j.properties but I am not sure how to configure log4j so that the driver and executor write to file. This is because if we set the spark.executor.extraJavaOptions it will read from a file and that is not what I need. How do I configure log4j from the app so

Re: Spark streaming alerting

2015-03-23 Thread Mohit Anchlia
I think I didn't explain myself properly :) What I meant to say was that generally spark worker runs on either on HDFS's data nodes or on Cassandra nodes, which typically is in a private network (protected). When a condition is matched it's difficult to send out the alerts directly from the worker

Spark-thriftserver Issue

2015-03-23 Thread Neil Dev
Hi, I am having issue starting spark-thriftserver. I'm running spark 1.3.with Hadoop 2.4.0. I would like to be able to change its port too so, I can hive hive-thriftserver as well as spark-thriftserver running at the same time. Starting sparkthrift server:- sudo ./start-thriftserver.sh --master

Re: newbie quesiton - spark with mesos

2015-03-23 Thread Dean Wampler
That's a very old page, try this instead: http://spark.apache.org/docs/latest/running-on-mesos.html When you run your Spark job on Mesos, tasks will be started on the slave nodes as needed, since fine-grained mode is the default. For a job like your example, very few tasks will be needed.

Re: Getting around Serializability issues for types not in my control

2015-03-23 Thread Cody Koeninger
Have you tried instantiating the instance inside the closure, rather than outside of it? If that works, you may need to switch to use mapPartition / foreachPartition for efficiency reasons. On Mon, Mar 23, 2015 at 3:03 PM, Adelbert Chang adelbe...@gmail.com wrote: Is there no way to pull out

Re: Strange behavior with PySpark when using Join() and zip()

2015-03-23 Thread Sean Owen
I think this is a bad example since testData is not deterministic at all. I thought we had fixed this or similar examples in the past? As in https://github.com/apache/spark/pull/1250/files Hm, anyone see a reason that shouldn't be changed too? On Mon, Mar 23, 2015 at 7:00 PM, Ofer Mendelevitch

Re: Strange behavior with PySpark when using Join() and zip()

2015-03-23 Thread Ofer Mendelevitch
Thanks Sean, Sorting definitely solves it, but I was hoping it could be avoided :) In the documentation for Classification in ML-Lib for example, zip() is used to create labelsAndPredictions: - from pyspark.mllib.tree import RandomForest from pyspark.mllib.util import MLUtils # Load and

SparkEnv

2015-03-23 Thread Koert Kuipers
is it safe to access SparkEnv.get inside say mapPartitions? i need to get a Serializer (so SparkEnv.get.serializer) thanks

Re: Getting around Serializability issues for types not in my control

2015-03-23 Thread Dean Wampler
Well, it's complaining about trait OptionInstances which is defined in Option.scala in the std package. Use scalap or javap on the scalaz library to find out which member of the trait is the problem, but since it says $$anon$1, I suspect it's the first value member, implicit val optionInstance,

JDBC DF using DB2

2015-03-23 Thread Jack Arenas
Hi Team, I’m trying to create a DF using jdbc as detailed here – I’m currently using DB2 v9.7.0.6 and I’ve tried to use the db2jcc.jar and db2jcc_license_cu.jar combo, and while it works in --master local using the command below, I get some strange behavior in --master yarn-client. Here is

Re: Use pig load function in spark

2015-03-23 Thread Denny Lee
You may be able to utilize Spork (Pig on Apache Spark) as a mechanism to do this: https://github.com/sigmoidanalytics/spork On Mon, Mar 23, 2015 at 2:29 AM Dai, Kevin yun...@ebay.com wrote: Hi, all Can spark use pig’s load function to load data? Best Regards, Kevin.

Re: Getting around Serializability issues for types not in my control

2015-03-23 Thread Adelbert Chang
Is there no way to pull out the bits of the instance I want before I sent it through the closure for aggregate? I did try pulling things out, along the lines of def foo[G[_], B](blah: Blah)(implicit G: Applicative[G]) = { val lift: B = G[RDD[B]] = b = G.point(sparkContext.parallelize(List(b)))

objectFile uses only java serializer?

2015-03-23 Thread Koert Kuipers
in the comments on SparkContext.objectFile it says: It will also be pretty slow if you use the default serializer (Java serialization) this suggests the spark.serializer is used, which means i can switch to the much faster kryo serializer. however when i look at the code it uses

Re: Converting SparkSQL query to Scala query

2015-03-23 Thread Dean Wampler
There isn't any automated way. Note that as the DataFrame implementation improves, it will probably do a better job with query optimization than hand-rolled Scala code. I don't know if that's true yet, though. For now, there are a few examples at the beginning of the DataFrame scaladocs

Re: SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

2015-03-23 Thread Michael Armbrust
There is not an interface to this at this time, and in general I'm hesitant to open up interfaces where the user could make a mistake where they think something is going to improve performance but will actually impact correctness. Since, as you say, we are picking the partitioner automatically in

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Akhil Das
One approach would be to repartition the whole data into 1 (costly operation though, but will give you a single file). Also, You could try using zipWithIndex before writing it out. Thanks Best Regards On Sat, Mar 21, 2015 at 4:11 AM, Michael Albert m_albert...@yahoo.com.invalid wrote:

Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a sql job with python udf i got a exception: java.lang.ArrayIndexOutOfBoundsException: 9 at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142) at

Re: Spark UI tunneling

2015-03-23 Thread Sergey Gerasimov
Akhil, that's what I did. The problem is that probably web server tried to forward my request to another address accessible locally only. 23 марта 2015 г., в 11:12, Akhil Das ak...@sigmoidanalytics.com написал(а): Did you try ssh -L 4040:127.0.0.1:4040 user@host Thanks Best Regards

Re: Data/File structure Validation

2015-03-23 Thread Ahmed Nawar
Dear Taotao, Yes, I tried sparkCSV. Thanks, Nawwar On Mon, Mar 23, 2015 at 12:20 PM, Taotao.Li taotao...@datayes.com wrote: can it load successfully if the format is invalid? -- *发件人: *Ahmed Nawar ahmed.na...@gmail.com *收件人: *user@spark.apache.org *发送时间:

Re: Data/File structure Validation

2015-03-23 Thread Ahmed Nawar
Dear Raunak, Source system provided logs with some errors. I need to make sure each row is in correct format (number of columns/ attributes and data types is correct) and move incorrect Rows to separated List. Of course i can do my logic but i need to make sure there is no direct way.

Re: Buffering for Socket streams

2015-03-23 Thread Akhil Das
You can try playing with spark.streaming.blockInterval so that it wont consume a lot of data, default value is 200ms Thanks Best Regards On Fri, Mar 20, 2015 at 8:49 PM, jamborta jambo...@gmail.com wrote: Hi all, We are designing a workflow where we try to stream local files to a Socket

Cassandra time series + Spark

2015-03-23 Thread Rumph, Frens Jan
Hi, I'm working on a system which has to deal with time series data. I've been happy using Cassandra for time series and Spark looks promising as a computational platform. I consider chunking time series in Cassandra necessary, e.g. by 3 weeks as kairosdb does it. This allows an 8 byte chunk

log files of failed task

2015-03-23 Thread sergunok
Hi, I executed a task on Spark in YARN and it failed. I see just executor lost message from YARNClientScheduler, no further details.. (I read ths error can be connected to spark.yarn.executor.memoryOverhead setting and already played with this param) How to go more deeply in details in log files

Re: Data/File structure Validation

2015-03-23 Thread Taotao.Li
can it load successfully if the format is invalid? - 原始邮件 - 发件人: Ahmed Nawar ahmed.na...@gmail.com 收件人: user@spark.apache.org 发送时间: 星期一, 2015年 3 月 23日 下午 4:48:54 主题: Data/File structure Validation Dears, Is there any way to validate the CSV, Json ... Files while loading to

Re: Spark Sql with python udf fail

2015-03-23 Thread Cheng Lian
Could you elaborate on the UDF code? On 3/23/15 3:43 PM, lonely Feb wrote: Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a sql job with python udf i got a exception: java.lang.ArrayIndexOutOfBoundsException: 9 at

Re: Spark streaming alerting

2015-03-23 Thread Jeffrey Jedele
What exactly do you mean by alerts? Something specific to your data or general events of the spark cluster? For the first, sth like Akhil suggested should work. For the latter, I would suggest having a log consolidation system like logstash in place and use this to generate alerts. Regards, Jeff

Data/File structure Validation

2015-03-23 Thread Ahmed Nawar
Dears, Is there any way to validate the CSV, Json ... Files while loading to DataFrame. I need to ignore corrupted rows.(Rows with not matching with the schema). Thanks, Ahmed Nawwar

Re: Spark streaming alerting

2015-03-23 Thread Akhil Das
What do you mean you can't send it directly from spark workers? Here's a simple approach which you could do: val data = ssc.textFileStream(sigmoid/) val dist = data.filter(_.contains(ERROR)).foreachRDD(rdd = alert(Errors : + rdd.count())) And the alert() function could be anything

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-23 Thread Night Wolf
Was a solution ever found for this. Trying to run some test cases with sbt test which use spark sql and in Spark 1.3.0 release with Scala 2.11.6 I get this error. Setting fork := true in sbt seems to work but its a less than idea work around. On Tue, Mar 17, 2015 at 9:37 PM, Eric Charles

Re: Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
ok i'll try asap 2015-03-23 17:00 GMT+08:00 Cheng Lian lian.cs@gmail.com: I suspect there is a malformed row in your input dataset. Could you try something like this to confirm: sql(SELECT * FROM your-table).foreach(println) If there does exist a malformed line, you should see similar

Re: Spark Sql with python udf fail

2015-03-23 Thread lonely Feb
sql(SELECT * FROM your-table).foreach(println) can be executed successfully. So the problem may still be in UDF code. How can i print the the line with ArrayIndexOutOfBoundsException in catalyst? 2015-03-23 17:04 GMT+08:00 lonely Feb lonely8...@gmail.com: ok i'll try asap 2015-03-23 17:00

Re: SocketTimeout only when launching lots of executors

2015-03-23 Thread Akhil Das
It seems your driver is getting flooded by those many executors and hence it gets timeout. There are some configuration options like spark.akka.timeout etc, you could try playing with those. More information will be available here: http://spark.apache.org/docs/latest/configuration.html Thanks

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-23 Thread Xi Shen
I did not build my own Spark. I got the binary version online. If it can load the native libs from IDE, it should also be able to load native when running with --matter local. On Mon, 23 Mar 2015 07:15 Burak Yavuz brk...@gmail.com wrote: Did you build Spark with: -Pnetlib-lgpl? Ref:

Re: Spark Sql with python udf fail

2015-03-23 Thread Cheng Lian
I suspect there is a malformed row in your input dataset. Could you try something like this to confirm: |sql(SELECT * FROM your-table).foreach(println) | If there does exist a malformed line, you should see similar exception. And you can catch it with the help of the output. Notice that the

Spark SQL udf(ScalaUdf) is very slow

2015-03-23 Thread zzcclp
My test env:1. Spark version is 1.3.02. 3 node per 80G/20C3. read 250G parquet files from hdfs Test case:1. register floor func with command: *sqlContext.udf.register(floor, (ts: Int) = ts - ts % 300), *then run with sql select chan, floor(ts) as tt, sum(size) from qlogbase3 group by chan,

Re: Spark streaming alerting

2015-03-23 Thread Khanderao Kand Gmail
Akhil You are right in tour answer to what Mohit wrote. However what Mohit seems to be alluring but did not write properly might be different. Mohit You are wrong in saying generally streaming works in HDFS and cassandra . Streaming typically works with streaming or queing source like Kafka,

Re: Spark UI tunneling

2015-03-23 Thread Akhil Das
Oh in that case you could try adding the hostname in your /etc/hosts under your localhost. Also make sure there is a request going to another host by inspecting the network calls: [image: Inline image 1] Thanks Best Regards On Mon, Mar 23, 2015 at 1:55 PM, Sergey Gerasimov ser...@gmail.com

Re: log files of failed task

2015-03-23 Thread Emre Sevinc
Hello Sergun, Generally you can use yarn application -list to see the applicationIDs of applications and then you can see the logs of finished applications using: yarn logs -applicationId applicationID Hope this helps. -- Emre Sevinç http://www.bigindustries.be/ On Mon, Mar 23, 2015

Re: How Does aggregate work

2015-03-23 Thread Paweł Szulc
It is actually number of cores. If your processor has hyperthreading then it will be more (number of processors your OS sees) niedz., 22 mar 2015, 4:51 PM Ted Yu użytkownik yuzhih...@gmail.com napisał: I assume spark.default.parallelism is 4 in the VM Ashish was using. Cheers

Re: How to handle under-performing nodes in the cluster

2015-03-23 Thread Akhil Das
It seems that node is not getting allocated with enough tasks, try increasing your level of parallelism or do a manual repartition so that everyone gets even tasks to operate on. Thanks Best Regards On Fri, Mar 20, 2015 at 8:05 PM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi all, I have 6

Re: updateStateByKey performance API

2015-03-23 Thread Andre Schumacher
Hi Nikos, We experienced something similar in our setting where the Spark app was supposed to write to a Redis instance the final state changes. Over time the delay caused by re-writing the entire dataset in each iteration exceeded the Spark streaming batch size. In our cased the solution was

Re: Spark UI tunneling

2015-03-23 Thread Akhil Das
Did you try ssh -L 4040:127.0.0.1:4040 user@host Thanks Best Regards On Mon, Mar 23, 2015 at 1:12 PM, sergunok ser...@gmail.com wrote: Is it a way to tunnel Spark UI? I tried to tunnel client-node:4040 but my browser was redirected from localhost to some cluster locally visible domain

Use pig load function in spark

2015-03-23 Thread Dai, Kevin
Hi, all Can spark use pig's load function to load data? Best Regards, Kevin.

Re: spark disk-to-disk

2015-03-23 Thread Reynold Xin
Maybe implement a very simple function that uses the Hadoop API to read in based on file names (i.e. parts)? On Mon, Mar 23, 2015 at 10:55 AM, Koert Kuipers ko...@tresata.com wrote: there is a way to reinstate the partitioner, but that requires sc.objectFile to read exactly what i wrote, which

Re: Getting around Serializability issues for types not in my control

2015-03-23 Thread Adelbert Chang
Instantiating the instance? The actual instance it's complaining about is: https://github.com/scalaz/scalaz/blob/16838556c9309225013f917e577072476f46dc14/core/src/main/scala/scalaz/std/Option.scala#L10-11 The specific import where it's picking up the instance is:

Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-23 Thread Alexey Zinoviev
Spark has a dependency on json4s 3.2.10, but this version has several bugs and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to build.sbt and everything compiled fine. But when I spark-submit my JAR it provides me with 3.2.10. build.sbt import sbt.Keys._ name := sparkapp

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-23 Thread Marcelo Vanzin
On Mon, Mar 23, 2015 at 2:15 PM, Manoj Samel manojsamelt...@gmail.com wrote: Found the issue above error - the setting for spark_shuffle was incomplete. Now it is able to ask and get additional executors. The issue is once they are released, it is not able to proceed with next query. That

Re: How to use DataFrame with MySQL

2015-03-23 Thread Rishi Yadav
for me, it's only working if I set --driver-class-path to mysql library. On Sun, Mar 22, 2015 at 11:29 PM, gavin zhang gavin@gmail.com wrote: OK,I found what the problem is: It couldn't work with mysql-connector-5.0.8. I updated the connector version to 5.1.34 and it worked. -- View

Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Bijay Pathak
Hello, I am running TeraSort https://github.com/ehiggs/spark-terasort on 100GB of data. The final metrics I am getting on Shuffle Spill are: Shuffle Spill(Memory): 122.5 GB Shuffle Spill(Disk): 3.4 GB What's the difference and relation between these two metrics? Does these mean 122.5 GB was

hadoop input/output format advanced control

2015-03-23 Thread Koert Kuipers
currently its pretty hard to control the Hadoop Input/Output formats used in Spark. The conventions seems to be to add extra parameters to all methods and then somewhere deep inside the code (for example in PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into settings on the

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-23 Thread Manoj Samel
Log shows stack traces that seem to match the assert in JIRA so it seems I am hitting the issue. Thanks for the heads up ... 15/03/23 20:29:50 ERROR actor.OneForOneStrategy: assertion failed: Allocator killed more executors than are allocated! java.lang.AssertionError: assertion failed: Allocator

Re: Using a different spark jars than the one on the cluster

2015-03-23 Thread Denny Lee
+1 - I currently am doing what Marcelo is suggesting as I have a CDH 5.2 cluster (with Spark 1.1) and I'm also running Spark 1.3.0+ side-by-side in my cluster. On Wed, Mar 18, 2015 at 1:23 PM Marcelo Vanzin van...@cloudera.com wrote: Since you're using YARN, you should be able to download a

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-23 Thread Manoj Samel
Found the issue above error - the setting for spark_shuffle was incomplete. Now it is able to ask and get additional executors. The issue is once they are released, it is not able to proceed with next query. The environment is CDH 5.3.2 (Hadoop 2.5) with Kerberos Spark 1.3 After idle time, the

SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

2015-03-23 Thread Stephen Boesch
Is there a way to take advantage of the underlying datasource partitions when generating a DataFrame/SchemaRDD via catalyst? It seems from the sql module that the only options are RangePartitioner and HashPartitioner - and further that those are selected automatically by the code . It was not

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-23 Thread Sandy Ryza
Hi Emre, The --conf property is meant to work with yarn-cluster mode. System.getProperty(key) isn't guaranteed, but new SparkConf().get(key) should. Does it not? -Sandy On Mon, Mar 23, 2015 at 8:39 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, According to Spark Documentation at

Parquet file + increase read parallelism

2015-03-23 Thread SamyaMaiti
Hi All, Suppose I have a parquet file of 100 MB in HDFS my HDFS block is 64MB, so I have 2 block of data. When I do, *sqlContext.parquetFile(path)* followed by an action , two tasks are stared on two partitions. My intend is to read this 2 blocks in more partitions to fully utilize my cluster

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Martin Goodson
Have you tried to repartition() your original data to make more partitions before you aggregate? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Mon, Mar 23, 2015 at 4:12 PM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, Yes, I have set

Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
i just realized the major limitation is that i lose partitioning info... On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com wrote: On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote: so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...)

Re: Write to Parquet File in Python

2015-03-23 Thread chuwiey
Hey Akriti23, pyspark gives you a saveAsParquetFile() api, to save your rdd as parquet. You will however, need to infer the schema or describe it manually before you can do so. Here are some docs about that (v1.2.1, you can search for the others, they're relatively similar 1.1 and up):

newbie quesiton - spark with mesos

2015-03-23 Thread Anirudha Jadhav
i have a mesos cluster, which i deploy spark to by using instructions on http://spark.apache.org/docs/0.7.2/running-on-mesos.html after that the spark shell starts up fine. then i try the following on the shell: val data = 1 to 1 val distData = sc.parallelize(data) distData.filter(_

Re: version conflict common-net

2015-03-23 Thread Jacob Abraham
Hi Sean, Thanks a ton for you reply. The particular situation I have is case (3) that you have mentioned. The class that I am using from commons-net is FTPClient(). This class is present in both the 2.2 version and the 3.3 version. However, in the 3.3 version there are two additional methods

Re: Strange behavior with PySpark when using Join() and zip()

2015-03-23 Thread Sean Owen
I think the explanation is that the join does not guarantee any order, since it causes a shuffle in general, and it is computed twice in the first example, resulting in a difference for d1 and d2. You can persist() the result of the join and in practice I believe you'd find it behaves as

Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
there is a way to reinstate the partitioner, but that requires sc.objectFile to read exactly what i wrote, which means sc.objectFile should never split files on reading (a feature of hadoop file inputformat that gets in the way here). On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers

Re: version conflict common-net

2015-03-23 Thread Sean Owen
I think it's spark.yarn.user.classpath.first in 1.2, and spark.{driver,executor}.extraClassPath in 1.3. Obviously that's for if you are using YARN, in the first instance. On Mon, Mar 23, 2015 at 5:41 PM, Jacob Abraham abe.jac...@gmail.com wrote: Hi Sean, Thanks a ton for you reply. The

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Michael Albert
Thanks for the information! (to all who responded) The code below *seems* to work.Any hidden gotcha's that anyone sees? And still, in terasort, how did they check that the data was actually sorted? :-) -Mike class MyInputFormat[T]    extends parquet.hadoop.ParquetInputFormat[T]{     override def

Converting SparkSQL query to Scala query

2015-03-23 Thread nishitd
I have a complex SparkSQL query of the nature select a.a, b.b, c.c from a,b,c where a.x = b.x and b.y = c.y How do I convert this efficiently into scala query of a.join(b,..,..) and so on. Can anyone help me with this? If my question needs more clarification, please let me know. -- View

  1   2   >