Re: Spark ec2 lunch problem

2015-08-24 Thread Robin East
spark-ec2 is the way to go however you may need to debug connectivity issues. For example do you know that the servers were correctly setup in AWS and can you access each node using ssh? If no then you need to work out why (it’s not a spark issue). If yes then you will need to work out why ssh

How to evaluate custom UDF over window

2015-08-24 Thread xander92
The ultimate aim of my program is to be able to wrap an arbitrary Scala function (mostly will be statistics / customized rolling window metrics) in a UDF and evaluate them on DataFrames using the window functionality. So my main question is how do I express that a UDF takes a Frame of rows from a

Re: Loading already existing tables in spark shell

2015-08-24 Thread Ishwardeep Singh
Hi Jeetendra, I faced this issue. I did not specify the database where this table exists. Please set the database by using use database command before executing the query. Regards, Ishwardeep From: Jeetendra Gangele gangele...@gmail.com Sent: Monday,

Performance - Python streaming v/s Scala streaming

2015-08-24 Thread utk.pat
I am new to SPARK streaming. I was running the kafka_wordcount example with a local KAFKA and SPARK instance. It was very easy to set this up and get going :)I tried running both SCALA and Python versions of the word count example. Python versions seems to be extremely slow. Sometimes it has

Re: Joining using mulitimap or array

2015-08-24 Thread Ilya Karpov
Thanks, but I think this is not the case of multiple spark contexts (never the less I tried your suggestion - didn’t worked). The problem is join to datasets using array items value: attribute.value in my case. Has anyone ideas? 24 авг. 2015 г., в 15:01, satish chandra j

Re: Joining using mulitimap or array

2015-08-24 Thread satish chandra j
Hi, If you join logic is correct, it seems to be a similar issue which i faced recently Can you try by *SparkContext(conf).set(spark.driver.allowMultipleContexts,true)* Regards, Satish Chandra On Mon, Aug 24, 2015 at 2:51 PM, Ilya Karpov i.kar...@cleverdata.ru wrote: Hi, guys I'm confused

Difficulties developing a Specs2 matcher for Spark Streaming

2015-08-24 Thread Juan Rodríguez Hortalá
Hi, I've had some troubles developing a Specs2 matcher that checks that a predicate holds for all the elements of an RDD, and using it for testing a simple Spark Streaming program. I've finally been able to get a code that works, you can see it in

RE: Spark ec2 lunch problem

2015-08-24 Thread Garry Chen
So what is the best way to deploy spark cluster in EC2 environment any suggestions? Garry From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Friday, August 21, 2015 4:27 PM To: Garry Chen g...@cornell.edu Cc: user@spark.apache.org Subject: Re: Spark ec2 lunch problem It may happen that

Re: Joining using mulitimap or array

2015-08-24 Thread Hemant Bhanawat
In your example, a.attributes.name is a list and is not a string . Run this to find it out : a.select($a.attributes.name).show() On Mon, Aug 24, 2015 at 2:51 PM, Ilya Karpov i.kar...@cleverdata.ru wrote: Hi, guys I'm confused about joining columns in SparkSQL and need your advice. I want to

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
The first job is to infer the json schema, and the second one is what you mean of the query. You can provide the schema while loading the json file, like below: sqlContext.read.schema(xxx).json(“…”)? Hao From: Jeff Zhang [mailto:zjf...@gmail.com] Sent: Monday, August 24, 2015 6:20 PM To:

Re: How to set environment of worker applications

2015-08-24 Thread Raghavendra Pandey
System properties and environment variables are two different things.. One can use spark.executor.extraJavaOptions to pass system properties and spark-env.sh to pass environment variables. -raghav On Mon, Aug 24, 2015 at 1:00 PM, Hemant Bhanawat hemant9...@gmail.com wrote: That's surprising.

RE: Loading already existing tables in spark shell

2015-08-24 Thread Cheng, Hao
And be sure the hive-site.xml is under the classpath or under the path of $SPARK_HOME/conf Hao From: Ishwardeep Singh [mailto:ishwardeep.si...@impetus.co.in] Sent: Monday, August 24, 2015 8:57 PM To: user Subject: Re: Loading already existing tables in spark shell Hi Jeetendra, I faced

DataFrame/JDBC very slow performance

2015-08-24 Thread Dhaval Patel
I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]). When I tried with BIG table (5B records) then no results returned upon completion of query. I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores,

Re: Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
Additional info...If I use an online md5sum check then it matches...So, it's either windows or python (using 2.7.10) On Mon, Aug 24, 2015 at 11:54 AM, Justin Pihony justin.pih...@gmail.com wrote: When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now seen this on two

Re: How to evaluate custom UDF over window

2015-08-24 Thread Yin Huai
For now, user-defined window function is not supported. We will add it in future. On Mon, Aug 24, 2015 at 6:26 AM, xander92 alexander.fra...@ompnt.com wrote: The ultimate aim of my program is to be able to wrap an arbitrary Scala function (mostly will be statistics / customized rolling window

Re: Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
I found this solution: https://stackoverflow.com/questions/3390484/python-hashlib-md5-differs-between-linux-windows Does anybody see a reason why I shouldn't put in a PR to make this change? FROM with open(tgz_file_path) as tar: TO with open(tgz_file_path, rb) as tar: On Mon, Aug 24, 2015 at

Re: Drop table and Hive warehouse

2015-08-24 Thread Michael Armbrust
Thats not the expected behavior. What version of Spark? On Mon, Aug 24, 2015 at 1:32 AM, Kevin Jung itsjb.j...@samsung.com wrote: When I store DataFrame as table with command saveAsTable and then execute DROP TABLE in SparkSQL, it doesn't actually delete files in hive warehouse. The table

Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now seen this on two different machines. I am running on windows, but I would imagine that shouldn't affect the md5. Is this a boto problem, python problem, spark problem? -- View this message in context:

Re: [Spark Streaming on Mesos (good practices)]

2015-08-24 Thread Aram Mkrtchyan
Here is the answer to my question if somebody needs it Running Spark in Standalone mode or coarse-grained Mesos mode leads to better task launch times than the fine-grained Mesos mode. The resource is http://spark.apache.org/docs/latest/streaming-programming-guide.html On Mon, Aug 24, 2015

Re: Unable to catch SparkContext methods exceptions

2015-08-24 Thread Burak Yavuz
textFile is a lazy operation. It doesn't evaluate until you call an action on it, such as .count(). Therefore, you won't catch the exception there. Best, Burak On Mon, Aug 24, 2015 at 9:09 AM, Roberto Coluccio roberto.coluc...@gmail.com wrote: Hello folks, I'm experiencing an unexpected

Re: Spark Direct Streaming With ZK Updates

2015-08-24 Thread suchenzang
Forgot to include the PR I was referencing: https://github.com/apache/spark/pull/4805/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Direct-Streaming-With-ZK-Updates-tp24423p24424.html Sent from the Apache Spark User List mailing list archive at

Re: Spark Direct Streaming With ZK Updates

2015-08-24 Thread Cody Koeninger
It doesn't matter if shuffling occurs. Just update ZK from the driver, inside the foreachRDD, after all your dynamodb updates are done. Since you're just doing it for monitoring purposes, that should be fine. On Mon, Aug 24, 2015 at 12:11 PM, suchenzang suchenz...@gmail.com wrote: Forgot to

Re: Kafka Spark Partition Mapping

2015-08-24 Thread Syed, Nehal (Contractor)
Dear Cody, Thanks for your response, I am trying to do decoration which means when a message comes from Kafka (partitioned by key) in to the Spark I want to add more fields/data to it. How Does normally people do it in Spark? If it were you how would you decorate message without hitting

Re: Unable to catch SparkContext methods exceptions

2015-08-24 Thread Burak Yavuz
The laziness is hard to deal with in these situations. I would suggest trying to handle expected cases FileNotFound, etc using other methods before even starting a Spark job. If you really want to try.catch a specific portion of a Spark job, one way is to just follow it with an action. You can

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Sereday, Scott
Can you please remove me from this distribution list? (Filling up my inbox too fast) From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Monday, August 24, 2015 2:13 PM To: Philip Weaver philip.wea...@gmail.com Cc: Jerrick Hoang jerrickho...@gmail.com; Raghavendra Pandey

Local Spark talking to remote HDFS?

2015-08-24 Thread Dino Fancellu
I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM. If I go into the guest spark-shell and refer to the file thus, it works fine val words=sc.textFile(hdfs:///tmp/people.txt) words.count However if I try to access it from a local Spark app on my Windows host, it doesn't

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Jerrick Hoang
@Michael: would listStatus calls read the actual parquet footers within the folders? On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott scott.sere...@nielsen.com wrote: Can you please remove me from this distribution list? (Filling up my inbox too fast) *From:* Michael Armbrust

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
No, starting with Spark 1.5 we should by default only be reading the footers on the executor side (that is unless schema merging has been explicitly turned on). On Mon, Aug 24, 2015 at 12:20 PM, Jerrick Hoang jerrickho...@gmail.com wrote: @Michael: would listStatus calls read the actual parquet

Re: Memory-efficient successive calls to repartition()

2015-08-24 Thread Alexis Gillain
Hi Aurelien, The first code should create a new RDD in memory at each iteration (check the webui). The second code will unpersist the RDD but that's not the main problem. I think you have trouble due to long lineage as .cache() keep track of lineage for recovery. You should have a look at

[Spark Streaming on Mesos (good practices)]

2015-08-24 Thread Aram Mkrtchyan
which are the best practices to submit spark streaming application on mesos. I would like to know about scheduler mode. Is `coarse-grained` mode right solution? Thanks

Drop table and Hive warehouse

2015-08-24 Thread Kevin Jung
When I store DataFrame as table with command saveAsTable and then execute DROP TABLE in SparkSQL, it doesn't actually delete files in hive warehouse. The table disappears from a table list but the data files are still alive. Because of this, I can't saveAsTable with a same name before dropping

DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Jeff Zhang
It's weird to me that the simple show function will cost 2 spark jobs. DataFrame#explain shows it is a very simple operation, not sure why need 2 jobs. == Parsed Logical Plan == Relation[age#0L,name#1] JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json] ==

Determinant of Matrix

2015-08-24 Thread Naveen
Hi, Is there any function to find the determinant of a mllib.linalg.Matrix (a covariance matrix) using Spark? Regards, Naveen - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

How to remove worker node but let it finish first?

2015-08-24 Thread Romi Kuntsman
Hi, I have a spark standalone cluster with 100s of applications per day, and it changes size (more or less workers) at various hours. The driver runs on a separate machine outside the spark cluster. When a job is running and it's worker is killed (because at that hour the number of workers is

Re: How to set environment of worker applications

2015-08-24 Thread Hemant Bhanawat
That's surprising. Passing the environment variables using spark.executor.extraJavaOptions=-Dmyenvvar=xxx to the executor and then fetching them using System.getProperty(myenvvar) has worked for me. What is the error that you guys got? On Mon, Aug 24, 2015 at 12:10 AM, Sathish Kumaran Vairavelu

Joining using mulitimap or array

2015-08-24 Thread Ilya Karpov
Hi, guys I'm confused about joining columns in SparkSQL and need your advice. I want to join 2 datasets of profiles. Each profile has name and array of attributes(age, gender, email etc). There can be mutliple instances of attribute with the same name, e.g. profile has 2 emails - so 2 attributes

Re: Memory-efficient successive calls to repartition()

2015-08-24 Thread alexis GILLAIN
Hi Aurelien, The first code should create a new RDD in memory at each iteration (check the webui). The second code will unpersist the RDD but that's not the main problem. I think you have trouble due to long lineage as .cache() keep track of lineage for recovery. You should have a look at

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-24 Thread satish chandra j
HI All, Please find fix info for users who are following the mail chain of this issue and the respective solution below: *reduceByKey: Non working snippet* import org.apache.spark.Context import org.apache.spark.Context._ import org.apache.spark.SparkConf val conf = new SparkConf() val sc = new

History server is not receiving any event

2015-08-24 Thread b.bhavesh
Hi, I am working on streaming application. I tried to configure history server to persist the events of application in hadoop file system (hdfs). However, it is not logging any events. I am running Apache Spark 1.4.1 (pyspark) under Ubuntu 14.04 with three nodes. Here is my configuration: File -

Array Out OF Bound Exception

2015-08-24 Thread SAHA, DEBOBROTA
Hi , I am using SPARK 1.4 and I am getting an array out of bound Exception when I am trying to read from a registered table in SPARK. For example If I have 3 different text files with the content as below: Scenario 1: A1|B1|C1 A2|B2|C2 Scenario 2: A1| |C1 A2| |C2 Scenario 3: A1| B1| A2| B2|

Re: Meaning of local[2]

2015-08-24 Thread Akhil Das
Just to add you can also look into SPARK_WORKER_INSTANCES configuration in the spark-env.sh file. On Aug 17, 2015 3:44 AM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Hi Praveen, On Mon, Aug 17, 2015 at 12:34 PM, praveen S mylogi...@gmail.com wrote: What does this mean in

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
Follow the directions here: http://spark.apache.org/community.html On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott scott.sere...@nielsen.com wrote: Can you please remove me from this distribution list? (Filling up my inbox too fast) *From:* Michael Armbrust

`show tables like 'tmp*';` does not work in Spark 1.3.0+

2015-08-24 Thread dugdun
Hi guys and gals, I have a Spark 1.2.0 instance running that I connect to via the thrift interface using beeline. On this instance I can send a command like `show tables like 'tmp*';` and I get a list of all tables that start with `tmp`. When testing this same command out on a server that is

Strange ClassNotFoundException in spark-shell

2015-08-24 Thread Jan Algermissen
Hi, I am using spark 1.4 M1 with the Cassandra Connector and run into a strange error when using the spark shell. This works: sc.cassandraTable(events, bid_events).select(bid,type).take(10).foreach(println) But as soon as I put a map() in there (or filter): sc.cassandraTable(events,

Running spark shell on mesos with zookeeper on spark 1.3.1

2015-08-24 Thread kohlisimranjit
I have setup up apache mesos using mesosphere on Cent OS 6 with Java 8.I have 3 slaves which total to 3 cores and 8 gb ram. I have set no firewalls. I am trying to run the following lines of code to test whether the setup is working: val data = 1 to 1 val distData = sc.parallelize(data)

Run Spark job from within iPython+Spark?

2015-08-24 Thread YaoPau
I set up iPython Notebook to work with the pyspark shell, and now I'd like use %run to basically 'spark-submit' another Python Spark file, and leave the objects accessible within the Notebook. I tried this, but got a ValueError: Cannot run multiple SparkContexts at once error. I then tried

Re: Array Out OF Bound Exception

2015-08-24 Thread Michael Armbrust
This top line here is indicating that the exception is being throw from your code (i.e. code written in the console). at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:40) Check to make sure that you are properly handling data

Re: Performance - Python streaming v/s Scala streaming

2015-08-24 Thread Tathagata Das
The scala version of the Kafka is something that we have been working on for a while, and is likely to be more optimized than the python one. The python one definitely requires pass the data back and forth between JVM and Python VM and decoding the raw bytes to the Python strings (probably less

ExternalSorter: Thread *** spilling in-memory map of 352.6 MB to disk (38 times so far)

2015-08-24 Thread d...@lumity.com
Hello, I'm trying to run a spark 1.5 job with: ./spark-shell --driver-java-options -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044 -Xms16g -Xmx48g -Xss128m I get lots of error messages like : 15/08/24 20:24:33 INFO ExternalSorter: Thread 172 spilling in-memory map of

Re: Spark ec2 lunch problem

2015-08-24 Thread Andrew Or
Hey Garry, Have you verified that your particular VPC and subnet are open to the world? In particular, have you verified the route table attached to your VPC / subnet contains an internet gateway open to the public? I've run into this issue myself recently and that was the problem for me.

Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Utkarsh Sengar
Continuing this discussion: http://apache-spark-user-list.1001560.n3.nabble.com/same-log4j-slf4j-error-in-spark-9-1-td5592.html I am getting this error when I use logback-classic. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Marcelo Vanzin
Hi Utkarsh, Unfortunately that's not going to be easy. Since Spark bundles all dependent classes into a single fat jar file, to remove that dependency you'd need to modify Spark's assembly jar (potentially in all your nodes). Doing that per-job is even trickier, because you'd probably need some

Re: spark and scala-2.11

2015-08-24 Thread Lanny Ripple
We're going to be upgrading from spark 1.0.2 and using hadoop-1.2.1 so need to build by hand. (Yes, I know. Use hadoop-2.x but standard resource constraints apply.) I want to build against scala-2.11 and publish to our artifact repository but finding build/spark-2.10.4 and tracing down what

Re: spark and scala-2.11

2015-08-24 Thread Jonathan Coveney
I've used the instructions and it worked fine. Can you post exactly what you're doing, and what it fails with? Or are you just trying to understand how it works? 2015-08-24 15:48 GMT-04:00 Lanny Ripple la...@spotright.com: Hello, The instructions for building spark against scala-2.11

Re: spark and scala-2.11

2015-08-24 Thread Sean Owen
The property scala-2.11 triggers the profile scala-2.11 -- and additionally disables the scala-2.10 profile, so that's the way to do it. But yes, you also need to run the script before-hand to set up the build for Scala 2.11 as well. On Mon, Aug 24, 2015 at 8:48 PM, Lanny Ripple

Re: Local Spark talking to remote HDFS?

2015-08-24 Thread Dino Fancellu
Changing the ip to the guest IP address just never connects. The VM has port tunnelling, and it passes through all the main ports, 8020 included to the host VM. You can tell that it was talking to the guest VM before, simply because it said when file not found Error is: Exception in thread

Re: Local Spark talking to remote HDFS?

2015-08-24 Thread Roberto Congiu
When you launch your HDP guest VM, most likely it gets launched with NAT and an address on a private network (192.168.x.x) so on your windows host you should use that address (you can find out using ifconfig on the guest OS). I usually add an entry to my /etc/hosts for VMs that I use oftenif

spark and scala-2.11

2015-08-24 Thread Lanny Ripple
Hello, The instructions for building spark against scala-2.11 indicate using -Dspark-2.11. When I look in the pom.xml I find a profile named 'spark-2.11' but nothing that would indicate I should set a property. The sbt build seems to need the -Dscala-2.11 property set. Finally build/mvn does a

Re: rdd count is throwing null pointer exception

2015-08-24 Thread Akhil Das
Move your count operation outside the foreach and use a broadcast to access it inside the foreach. On Aug 17, 2015 10:34 AM, Priya Ch learnings.chitt...@gmail.com wrote: Looks like because of Spark-5063 RDD transformations and actions can only be invoked by the driver, not inside of other

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Utkarsh Sengar
Hi Marcelo, When I add this exclusion rule to my pom: dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.4.1/version exclusions exclusion groupIdorg.slf4j/groupId

Re: Spark Direct Streaming With ZK Updates

2015-08-24 Thread suchenzang
When updating the ZK offset in the driver (within foreachRDD), there is somehow a serialization exception getting thrown: 15/08/24 15:45:40 ERROR JobScheduler: Error in job generator java.io.NotSerializableException: org.I0Itec.zkclient.ZkClient at

Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Jeff Zhang
Hi Cheng, I know that sqlContext.read will trigger one spark job to infer the schema. What I mean is DataFrame#show cost 2 spark jobs. So overall it would cost 3 jobs. Here's the command I use: val df =

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Utkarsh Sengar
I get the same error even when I set the SPARK_CLASSPATH: export SPARK_CLASSPATH=/.m2/repository/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.1.jar:/.m2/repository/ch/qos/logback/logback-core/1.1.2/logback-core-1.1.2.jar And I run the job like this:

Where is Redgate's HDFS explorer?

2015-08-24 Thread Dino Fancellu
http://hortonworks.com/blog/windows-explorer-experience-hdfs/ Seemed to exist, now now sign. Anything similar to tie HDFS into windows explorer? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html Sent

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Marcelo Vanzin
Hi Utkarsh, A quick look at slf4j's source shows it loads the first StaticLoggerBinder in your classpath. How are you adding the logback jar file to spark-submit? If you use spark.driver.extraClassPath and spark.executor.extraClassPath to add the jar, it should take precedence over the log4j

Re: DataFrame/JDBC very slow performance

2015-08-24 Thread Michael Armbrust
Much appreciated! I am not comparing with select count(*) for performance, but it was one simple thing I tried to check the performance :). I think it now makes sense since Spark tries to extract all records before doing the count. I thought having an aggregated function query submitted over

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Utkarsh Sengar
That didn't work since extraClassPath flag was still appending the jars at the end, so its still picking the slf4j jar provided by spark. Although I found this flag: --conf spark.executor.userClassPathFirst=true (http://spark.apache.org/docs/latest/configuration.html) and tried this: ➜ simspark

Re: Protobuf error when streaming from Kafka

2015-08-24 Thread Ted Yu
Can you show the complete stack trace ? Which Spark / Kafka release are you using ? Thanks On Mon, Aug 24, 2015 at 4:58 PM, Cassa L lcas...@gmail.com wrote: Hi, I am storing messages in Kafka using protobuf and reading them into Spark. I upgraded protobuf version from 2.4.1 to 2.5.0. I got

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Marcelo Vanzin
On Mon, Aug 24, 2015 at 3:58 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: That didn't work since extraClassPath flag was still appending the jars at the end, so its still picking the slf4j jar provided by spark. Out of curiosity, how did you verify this? The extraClassPath options are

Protobuf error when streaming from Kafka

2015-08-24 Thread Cassa L
Hi, I am storing messages in Kafka using protobuf and reading them into Spark. I upgraded protobuf version from 2.4.1 to 2.5.0. I got java.lang.UnsupportedOperationException for older messages. However, even for new messages I get the same error. Spark does convert it though. I see my messages.

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-24 Thread Utkarsh Sengar
I assumed that's the case beacause of the error I got and the documentation which says: Extra classpath entries to append to the classpath of the driver. This is where I stand now: dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId

What does Attribute and AttributeReference mean in Spark SQL

2015-08-24 Thread Todd
There are many such kind of case class or concept such as Attribute/AttributeReference/Expression in Spark SQL I would ask what Attribute/AttributeReference/Expression mean, given a sql query like select a,b from c, it a, b are two Attributes? a + b is an expression? Looks I misunderstand it

Re: Too many files/dirs in hdfs

2015-08-24 Thread Mohit Anchlia
Any help would be appreciated On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia mohitanch...@gmail.com wrote: My question was how to do this in Hadoop? Could somebody point me to some examples? On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: Of course, Java or

Re: MLlib Prefixspan implementation

2015-08-24 Thread Feynman Liang
CCing the mailing list again. It's currently not on the radar. Do you have a use case for it? I can bring it up during 1.6 roadmap planning tomorrow. On Mon, Aug 24, 2015 at 8:28 PM, alexis GILLAIN ila...@hotmail.com wrote: Hi, I just realized the article I mentioned is cited in the jira and

Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Shixiong Zhu
Hao, I can reproduce it using the master branch. I'm curious why you cannot reproduce it. Did you check if the input HadoopRDD did have two partitions? My test code is val df = sqlContext.read.json(examples/src/main/resources/people.json) df.show() Best Regards, Shixiong Zhu 2015-08-25 13:01

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
Hi Jeff, which version are you using? I couldn’t reproduce the 2 spark jobs in the `df.show()` with latest code, we did refactor the code for json data source recently, not sure you’re running an earlier version of it. And a known issue is Spark SQL will try to re-list the files every time when

Re: build spark 1.4.1 with JDK 1.6

2015-08-24 Thread Eric Friedman
I'm trying to build Spark 1.4 with Java 7 and despite having that as my JAVA_HOME, I get [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ spark-launcher_2.10 --- [INFO] Using zinc server for incremental compilation [info] Compiling 8 Java sources to

Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Shixiong Zhu
Because defaultMinPartitions is 2 (See https://github.com/apache/spark/blob/642c43c81c835139e3f35dfd6a215d668a474203/core/src/main/scala/org/apache/spark/SparkContext.scala#L2057 ), your input people.json will be split to 2 partitions. At first, `take` will start a job for the first partition.

Re: Drop table and Hive warehouse

2015-08-24 Thread Kevin Jung
Thanks, Michael. I discovered it myself. Finally, it was not a bug from Spark. I have two HDFS cluster and Hive uses hive.metastore.warehouse.dir + fs.defaultFS(HDFS1) for saving internal tables and also reference a default database URI(HDFS2) in DBS table from metastore. It may not be a

Re: Spark Direct Streaming With ZK Updates

2015-08-24 Thread Cody Koeninger
I'd start off by trying to simplify that closure - you don't need the transform step, or currOffsetRanges to be scoped outside of it. Just do everything in foreachRDD. LIkewise, it looks like zkClient is also scoped outside of the closure passed to foreachRDD i.e. you have zkClient = new

RE: Test case for the spark sql catalyst

2015-08-24 Thread Cheng, Hao
Yes, check the source code under: https://github.com/apache/spark/tree/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst From: Todd [mailto:bit1...@163.com] Sent: Tuesday, August 25, 2015 1:01 PM To: user@spark.apache.org Subject: Test case for the spark sql catalyst Hi, Are

Re: Spark

2015-08-24 Thread Sonal Goyal
I think you could try sorting the endPointsCount and then doing a take. This should be a distributed process and only the result would get returned to the driver. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at Spark Summit 2015

Re: How to list all dataframes and RDDs available in current session?

2015-08-24 Thread Dhaval Gmail
Okay but how? thats what I am trying to figure out ? Any command you would suggest? Sent from my iPhone, plaese excuse any typos :) On Aug 21, 2015, at 11:45 PM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: You get the list of all the persistet rdd using spark context... On

Spark

2015-08-24 Thread Spark Enthusiast
I was running a Spark Job to crunch a 9GB apache log file When I saw the following error: 15/08/25 04:25:16 WARN scheduler.TaskSetManager: Lost task 99.0 in stage 37.0 (TID 4115, ip-10-150-137-100.ap-southeast-1.compute.internal): ExecutorLostFailure (executor 29 lost)15/08/25 04:25:16 INFO

Test case for the spark sql catalyst

2015-08-24 Thread Todd
Hi, Are there test cases for the spark sql catalyst, such as testing the rules of transforming unsolved query plan? Thanks!

How to access Spark UI through AWS

2015-08-24 Thread Justin Pihony
I am using the steps from this article https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to get spark up and running on EMR through yarn. Once up and running I ssh in and cd to the spark bin and run spark-shell --master yarn. Once this spins up I can see that the UI is started