BinaryFiles to ZipInputStream

2016-03-23 Thread Benjamin Kim
I need a little help. I am loading into Spark 1.6 zipped csv files stored in s3. First of all, I am able to get the List of file keys that have a modified date within a range of time by using the AWS SDK Objects (AmazonS3Client, ObjectListing, S3ObjectSummary, ListObjectsRequest,

Re: spark 1.6.0 connect to hive metastore

2016-03-23 Thread Koert Kuipers
can someone provide the correct settings for spark 1.6.1 to work with cdh 5 (hive 1.1.0)? in particular the settings for: spark.sql.hive.version spark.sql.hive.metastore.jars also it would be helpful to know if your spark jar includes hadoop dependencies or not. i realize it works (or at least

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Mark Hamstra
Yes, the terminology is being used sloppily/non-standardly in this thread -- "the last RDD" after a series of transformation is the RDD at the beginning of the chain, just now with an attached chain of "to be done" transformations when an action is eventually run. If the saveXXX action is the

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Ted Yu
bq. when I get the last RDD If I read Todd's first email correctly, the computation has been done. I could be wrong. On Wed, Mar 23, 2016 at 7:34 PM, Mark Hamstra wrote: > Neither of you is making any sense to me. If you just have an RDD for > which you have specified

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Mark Hamstra
Neither of you is making any sense to me. If you just have an RDD for which you have specified a series of transformations but you haven't run any actions, then neither checkpointing nor saving makes sense -- you haven't computed anything yet, you've only written out the recipe for how the

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Ted Yu
See the doc for checkpoint: * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint * directory set with `SparkContext#setCheckpointDir` and all references to its parent * RDDs will be removed. *This function must be called before any job has been* * *

What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Todd
Hi, I have a long computing chain, when I get the last RDD after a series of transformation. I have two choices to do with this last RDD 1. Call checkpoint on RDD to materialize it to disk 2. Call RDD.saveXXX to save it to HDFS, and read it back for further processing I would ask which choice

Re: calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar
Thank you again For val r = df.filter(col("paid") > "").map(x => (x.getString(0),x.getString(1).) Can you give an example of column expression please like df.filter(col("paid") > "").col("firstcolumn").getString   ? On Thursday, 24 March 2016, 0:45, Michael Armbrust

Re: calling individual columns from spark temporary table

2016-03-23 Thread Michael Armbrust
You can only use as on a Column expression, not inside of a lambda function. The reason is the lambda function is compiled into opaque bytecode that Spark SQL is not able to see. We just blindly execute it. However, there are a couple of ways to name the columns that come out of a map. Either

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Yong Zhang
Sounds good. I will manual merge this patch on 1.6.1, and test again for my case tomorrow on my environment and will update later. Thanks Yong > Date: Wed, 23 Mar 2016 16:20:23 -0700 > Subject: Re: Spark 1.5.2, why the broadcast join shuffle so much data in the > last step > From:

Re: calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar
thank you sir sql("select `_1` as firstcolumn from items") is there anyway one can keep the csv column names using databricks when mapping val r = df.filter(col("paid") > "").map(x => (x.getString(0),x.getString(1).) can I call example  x.getString(0).as.(firstcolumn) in above when mapping

Re: calling individual columns from spark temporary table

2016-03-23 Thread Michael Armbrust
You probably need to use `backticks` to escape `_1` since I don't think that its a valid SQL identifier. On Wed, Mar 23, 2016 at 5:10 PM, Ashok Kumar wrote: > Gurus, > > If I register a temporary table as below > > r.toDF > res58: org.apache.spark.sql.DataFrame =

calling individual columns from spark temporary table

2016-03-23 Thread Ashok Kumar
Gurus, If I register a temporary table as below  r.toDFres58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4: double, _5: double] r.toDF.registerTempTable("items") sql("select * from items")res60: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4:

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Ted Yu
SPARK-13383 is fixed in 2.0 only, as of this moment. Any chance of backporting to branch-1.6 ? Thanks On Wed, Mar 23, 2016 at 4:20 PM, Davies Liu wrote: > On Wed, Mar 23, 2016 at 10:35 AM, Yong Zhang wrote: > > Here is the output: > > > > ==

Best way to determine # of workers

2016-03-23 Thread Ajaxx
I'm building some elasticity into my model and I'd like to know when my workers have come online. It appears at present that the API only supports getting information about applications. Is there a good way to determine how many workers are available? -- View this message in context:

Forcing data from disk to memory

2016-03-23 Thread Daniel Imberman
Hi all, So I have a question about persistence. Let's say I have an RDD that's persisted MEMORY_AND_DISK, and I know that I now have enough memory space cleared up that I can force the data on disk into memory. Is it possible to tell spark to re-evaluate the open RDD memory and move that

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Davies Liu
On Wed, Mar 23, 2016 at 10:35 AM, Yong Zhang wrote: > Here is the output: > > == Parsed Logical Plan == > Project [400+ columns] > +- Project [400+ columns] >+- Project [400+ columns] > +- Project [400+ columns] > +- Join Inner, Somevisid_high#460L =

Re: Spark with Druid

2016-03-23 Thread Michael Malak
Will Spark 2.0 Structured Streaming obviate some of the Druid/Spark use cases? From: Raymond Honderdors To: "yuzhih...@gmail.com" Cc: "user@spark.apache.org" Sent: Wednesday, March 23, 2016 8:43 AM Subject:

Re: Converting array of string type to datetime

2016-03-23 Thread Mich Talebzadeh
Hi Jacek, I was wondering if I could use this approach itself. It is basically a CSV read in as follows: val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header",

Re: Converting array of string type to datetime

2016-03-23 Thread Jacek Laskowski
Hi, Why don't you use Datasets? You'd cut the number of getStrings and it'd read nicer to your eyes. Also, doing such transformations would *likely* be easier. p.s. Please gist your example to fix it. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark

Converting array of string type to datetime

2016-03-23 Thread Mich Talebzadeh
How can I convert the following from String to datetime scala> df.map(x => (x.getString(1), ChangeDate(x.getString(1.take(1) res60: Array[(String, String)] = Array((10/02/2014,2014-02-10)) Please note that the custom UDF ChangeDate() has revered the string value from "dd/MM/" to

Is streaming bisecting k-means possible?

2016-03-23 Thread dustind
Is it technically possible to use bisecting k-means on streaming data and allow for model decay+updating like streaming k-means? If so, could you provide some guidance on how to implement this? If not, what is the best approach to using bisecting k-means on streaming data? -- View this

Spark thrift issue 8659 (changing subject)

2016-03-23 Thread ayan guha
> > Hi All > > I found this issue listed in Spark Jira - > https://issues.apache.org/jira/browse/SPARK-8659 > > I would love to know if there are any roadmap for this? Maybe someone from > dev group can confirm? > > Thank you in advance > > Best > Ayan > >

Re: Spark Streaming UI duration numbers mismatch

2016-03-23 Thread Jean-Baptiste Onofré
Hi Jatin, I will reproduce tomorrow and take a look. Did you already create a Jira about that (I don't think so) ? If I reproduce the problem (and it's really a problem), then I will create one for you. Thanks, Regards JB On 03/23/2016 08:20 PM, Jatin Kumar wrote: Hello, Can someone

Re: Spark Streaming UI duration numbers mismatch

2016-03-23 Thread Jatin Kumar
Hello, Can someone please provide some help on the below issue? -- Thanks Jatin On Tue, Mar 22, 2016 at 3:30 PM, Jatin Kumar wrote: > Hello all, > > I am running spark streaming application and the duration numbers on batch > page and job page don't match. Please

Re: ClassNotFoundException in RDD.map

2016-03-23 Thread Dirceu Semighini Filho
Thanks Jacob, I've looked into the source code here and found that I miss this property there: spark.repl.class.uri Putting it solved the problem Cheers 2016-03-17 18:14 GMT-03:00 Jakob Odersky : > The error is very strange indeed, however without code that reproduces > it,

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Yong Zhang
Here is the output: == Parsed Logical Plan ==Project [400+ columns]+- Project [400+ columns] +- Project [400+ columns] +- Project [400+ columns] +- Join Inner, Somevisid_high#460L = visid_high#948L) && (visid_low#461L = visid_low#949L)) && (date_time#25L > date_time#513L)))

Re: Serialization issue with Spark

2016-03-23 Thread Dirceu Semighini Filho
Hello Hafsa, TaskNotSerialized exception usually means that you are trying to use an object, defined in the driver, in code that runs on workers. Can you post the code that ir generating this error here, so we can better advise you? Cheers. 2016-03-23 14:14 GMT-03:00 Hafsa Asif

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Davies Liu
The broadcast hint does not work as expected in this case, could you also how the logical plan by 'explain(true)'? On Wed, Mar 23, 2016 at 8:39 AM, Yong Zhang wrote: > > So I am testing this code to understand "broadcast" feature of DF on Spark > 1.6.1. > This time I am

Re: Serialization issue with Spark

2016-03-23 Thread Hafsa Asif
Can anyone please help me in this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Serialization-issue-with-Spark-tp26565p26579.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

running fpgrowth -- run method

2016-03-23 Thread Bauer, Robert
I get warnings: SparkContext: Requesting executors is only supported in coarse-grained mode ExecutorAllocationManager: Unable to reach the cluster manager to request 2 total executors I get info messages: INFO ContextCleaner: Cleaned accumulator 4 Then my "job" just seems to hang - I don't

Re: Spark and DB connection pool

2016-03-23 Thread Takeshi Yamamuro
Hi, Currently, Spark itself doesn't pool JDBC connections. If you face performance difficulty, all you can do is to cache loaded data from RDB and Cassandra in Spark. thanks, maropu On Wed, Mar 23, 2016 at 11:56 PM, rjtokenring wrote: > Hi all, is there a way in

Streaming bisecting k-means possible?

2016-03-23 Thread Dustin Decker
Is it technically possible to use bisecting k-means on streaming data and allow for model decay/updating like streaming k-means? If not, what is the best approach to using bisecting k-means on streaming data?

Spark and DB connection pool

2016-03-23 Thread rjtokenring
Hi all, is there a way in spark to setup a connection pool? As example: I'm going to use a relational DB and Cassandra to join data between them. How can I control and cache DB connections? Thanks all! Mark -- View this message in context:

Re: Spark with Druid

2016-03-23 Thread Raymond Honderdors
I saw these but i fail to understand how to direct the code to use rhe index json Sent from Outlook Mobile On Wed, Mar 23, 2016 at 7:19 AM -0700, "Ted Yu" > wrote: Please see:

Re: Spark with Druid

2016-03-23 Thread Ted Yu
Please see: https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani which references https://github.com/SparklineData/spark-druid-olap On Wed, Mar 23, 2016 at 5:59 AM, Raymond Honderdors < raymond.honderd...@sizmek.com> wrote: > Does anyone have a good

Re: Spark Metrics Framework?

2016-03-23 Thread Mike Sukmanowsky
Thanks Ted and Silvio. I think I'll need a bit more hand holding here, sorry. The way we use ES Hadoop is in pyspark via org.elasticsearch.hadoop.mr.EsOutputFormat

Spark with Druid

2016-03-23 Thread Raymond Honderdors
Does anyone have a good overview on how to integrate Spark and Druid? I am now struggling with the creation of a druid data source in spark. Raymond Honderdors Team Lead Analytics BI Business Intelligence Developer raymond.honderd...@sizmek.com T

RE: Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-23 Thread Joshua Dickerson
Hi everyone, We have been successfully deploying Spark jobs through Oozie using the spark-action for over a year, however, we are deploying to our on-premises Hadoop infrastructure, not EC2.  Our process is to build a fat-jar with our job and dependencies, upload that jar to HDFS

DataFrames UDAF with array and struct

2016-03-23 Thread Matthias Niehoff
Hello Everybody, I want to write an UDAF for DataFrames where the Buffer Schema is a ArrayType(StructType(List(StructField(String), StructField(String),StructField(String When I want to access the buffer in the update() or merge() method the ArrayType gets returned as a List. But what is

Re: Plot DataFrame with matplotlib

2016-03-23 Thread Teng Qiu
e... then this sounds like a feature requirement for matplotlib, you need to make matplotlib's APIs support RDD or spark DataFrame object, i checked the API of mplot3d (http://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html#mpl_toolkits.mplot3d.Axes3D.scatter), it only supports "array-like"

[Spark -1.5.2]Dynamically creation of caseWhen expression

2016-03-23 Thread Divya Gehlot
Hi, I have a map collection . I am trying to build when condition based on the key values . Like df.withColumn("ID", when( condition with map keys ,values of map ) How can I do that dynamically. Currently I am iterating over keysIterator and get the values Kal keys = myMap.keysIterator.toArray

Using Spark to retrieve a HDFS file protected by Kerberos

2016-03-23 Thread Nkechi Achara
I am having issues setting up my spark environment to read from a kerberized HDFS file location. At the moment I have tried to do the following: def ugiDoAs[T](ugi: Option[UserGroupInformation])(code: => T) = ugi match { case None => code case Some(u) => u.doAs(new

Re: Plot DataFrame with matplotlib

2016-03-23 Thread Yavuz Nuzumlalı
Thanks for help, but the example that you referenced gets the values from RDD as list and plots that list. What I am specifically asking was that is there a convenient way to plot a DataFrame object directly?(like pandas DataFrame objects) On Wed, Mar 23, 2016 at 11:47 AM Teng Qiu

Re: Zeppelin Integration

2016-03-23 Thread ayan guha
Hi All After spending few more days with the issue, I finally found the issue listed in Spark Jira - https://issues.apache.org/jira/browse/SPARK-8659 I would love to know if there are any roadmap for this? Maybe someone from dev group can confirm? Thank you in advance Best Ayan On Thu, Mar

Re: Spark SQL Optimization

2016-03-23 Thread Takeshi Yamamuro
Hi, all What's the size of three tables? Also, what's the performance difference of the two queries? On Tue, Mar 22, 2016 at 3:53 PM, Rishi Mishra wrote: > What we have observed so far is Spark picks join order in the same order > as tables in from clause is specified.

Applying filter to a date column

2016-03-23 Thread Mich Talebzadeh
Hi, I have a UDF created and registered as below val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") val current_date = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/')

Cached Parquet file paths problem

2016-03-23 Thread psmolinski
Hi, After migration from Spark 1.5.2 to 1.6.1 I faced strange issue. I have a Parquet directory with partitions. Each partition (month) is a subject of incremental ETL that takes current Avro files and replaces the corresponding Parquet files. Now there is a problem that appeared in 1.6.x: I

Re: Plot DataFrame with matplotlib

2016-03-23 Thread Teng Qiu
not sure about 3d plot, but there is a nice example: https://github.com/zalando/spark-appliance/blob/master/examples/notebooks/PySpark_sklearn_matplotlib.ipynb for plotting rdd or dataframe using matplotlib. Am Mittwoch, 23. März 2016 schrieb Yavuz Nuzumlalı : > Hi all, > I'm trying to plot the

Plot DataFrame with matplotlib

2016-03-23 Thread Yavuz Nuzumlalı
Hi all, I'm trying to plot the result of a simple PCA operation, but couldn't find a clear documentation about plotting data frames. Here is the output of my data frame: ++ |pca_features

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-23 Thread Surendra , Manchikanti
Hi Vetal, You may try with MultiOutPutFormat instead of TextOutPutFormat in saveAsNewAPIHadoopFile(). Regards, Surendra M -- Surendra Manchikanti On Tue, Mar 22, 2016 at 10:26 AM, vetal king wrote: > We are using Spark 1.4 for Spark Streaming. Kafka is data source for