Extract timestamp from Kafka message

2016-09-25 Thread Kevin Tran
Hi Everyone, Does anyone know how could we extract timestamp from Kafka message in Spark streaming ? JavaPairInputDStream messagesDStream = KafkaUtils.createDirectStream( ssc, String.class, String.class, StringDecoder.class, StringDecoder.class,

Spark 2.0 Structured Streaming: sc.parallelize in foreach sink cause Task not serializable error

2016-09-25 Thread Jianshi
Dear all: I am trying out the new released feature of structured streaming in Spark 2.0. I use the Structured Streaming to perform windowing by event time. I can print out the result in the console. I would like to write the result to Cassandra database through the foreach sink option. I am

udf forces usage of Row for complex types?

2016-09-25 Thread Koert Kuipers
after having gotten used to have case classes represent complex structures in Datasets, i am surprised to find out that when i work in DataFrames with udfs no such magic exists, and i have to fall back to manipulating Row objects, which is error prone and somewhat ugly. for example: case class

Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-25 Thread Peter Figliozzi
Both df.write.csv("/path/to/foo") and df.write.format("com.databricks.spark.csv").save("/path/to/foo") results in a *blank* file called "_SUCCESS" under /path/to/foo. My df has stuff in it.. tried this with both my real df, and a quick df constructed from literals. Why isn't it writing

Re: ArrayType support in Spark SQL

2016-09-25 Thread Koert Kuipers
not pretty but this works: import org.apache.spark.sql.functions.udf df.withColumn("array", sqlf.udf({ () => Seq(1, 2, 3) }).apply()) On Sun, Sep 25, 2016 at 6:13 PM, Jason White wrote: > It seems that `functions.lit` doesn't support ArrayTypes. To reproduce: > >

Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-25 Thread Mario Ds Briggs
Hi Daniel, can you give it a try in the IBM's Analytics for Spark, the fix has been in for a week now thanks Mario From: Daniel Lopes To: Mario Ds Briggs/India/IBM@IBMIN Cc: Adam Roberts , user ,

Re: Spark MLlib ALS algorithm

2016-09-25 Thread Roshani Nagmote
Hello, I ran ALS algorithm on 30 c4.8xlarge machines(60GB RAM each) with dataset(1.4GB) Netflix dataset (Users: 480189, Items: 17770, Ratings: 99M) *Command* I run: /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --jars /usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar

MLib Documentation Update Needed

2016-09-25 Thread Tobi Bosede
The loss function here for logistic regression is confusing. It seems to imply that spark uses only -1 and 1 class labels. However it uses 0,1 as the very inconspicuous note quoted below (under Classification)

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-25 Thread Piotr Smoliński
Hi Peter, The blank file _SUCCESS indicates properly finished output operation. What is the topology of your application? I presume, you write to local filesystem and have more than one worker machine. In such case Spark will write the result files for each partition (in the worker which holds

Re: spark stream based deduplication

2016-09-25 Thread backtrack5
Thank you @markcitizen . What I want to achieve is , say for an example My historic rdd has (Hash1, recordid1) (Hash2,recordid2) And in the new steam I have the following, (Hash3, recordid3) (Hash1,recordid5) In this above scenario, 1) for recordid5,I should get recordid5 is duplicate of

Left Join Yields Results And Not Results

2016-09-25 Thread Aaron Jackson
Hi, I'm using pyspark (1.6.2) to do a little bit of ETL and have noticed a very odd situation. I have two dataframes, base and updated. The "updated" dataframe contains constrained subset of data from "base" that I wish to excluded. Something like this. updated = base.where(base.X =

Re: How to use Spark-Scala to download a CSV file from the web?

2016-09-25 Thread Marco Mistroni
Hi not sure if spark-csv supports the http:// format you use to load data from the WEB. I just tried this and got exception scala> val df = sqlContext.read. | format("com.databricks.spark.csv"). | option("inferSchema", "true"). |

Re: How to use Spark-Scala to download a CSV file from the web?

2016-09-25 Thread Jörn Franke
Use a tool like flume and/or oozie to reliable download files from http and do error handling (e.g. Requests time out). Afterwards process the data with spark. > On 25 Sep 2016, at 10:27, Dan Bikle wrote: > > hello spark-world, > > How to use Spark-Scala to download a CSV

How to use Spark-Scala to download a CSV file from the web?

2016-09-25 Thread Dan Bikle
hello spark-world, How to use Spark-Scala to download a CSV file from the web and load the file into a spark-csv DataFrame? Currently I depend on curl in a shell command to get my CSV file. Here is the syntax I want to enhance: */* fb_csv.scalaThis script should load FB

Re: spark-submit failing but job running from scala ide

2016-09-25 Thread Jacek Laskowski
Hi, Can you execute run-example SparkPi with your Spark installation? Also, see the logs: 16/09/24 23:15:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 16/09/24 23:15:15 INFO Utils: Successfully started service 'SparkUI' on port 4041. You've got two Spark

In Spark-Scala, how to copy Array of Lists into new DataFrame?

2016-09-25 Thread Dan Bikle
Hello World, I am familiar with Python and I am learning Spark-Scala. I want to build a DataFrame which has structure desribed by this syntax: *// Prepare training data from a list of (label, features) tuples.val training = spark.createDataFrame(Seq( (1.1, Vectors.dense(1.1, 0.1)),

Re: udf forces usage of Row for complex types?

2016-09-25 Thread Bedrytski Aliaksandr
Hi Koert, these case classes you are talking about, should be serializeable to be efficient (like kryo or just plain java serialization). DataFrame is not simply a collection of Rows (which are serializeable by default), it also contains a schema with different type for each column. This way any

Re: Open source Spark based projects

2016-09-25 Thread Simon Chan
PredictionIO is an open-source machine learning server project based on Spark - http://predictionio.incubator.apache.org/ Simon On Fri, Sep 23, 2016 at 12:46 PM, manasdebashiskar wrote: > check out spark packages https://spark-packages.org/ and you will find few >

spark-submit failing but job running from scala ide

2016-09-25 Thread vr spark
Hi, I have this simple scala app which works fine when i run it as scala application from the scala IDE for eclipse. But when i export is as jar and run it from spark-submit i am getting below error. Please suggest *bin/spark-submit --class com.x.y.vr.spark.first.SimpleApp test.jar* 16/09/24

Re: In Spark-Scala, how to copy Array of Lists into new DataFrame?

2016-09-25 Thread Marco Mistroni
Hi i must admit , i had issues as well in finding a sample that does that, (hopefully Spark folks can add more examples or someone on the list can post a sample code?) hopefully you can reuse sample below So, you start from an rdd of doubles (myRdd) ## make a row val toRddOfRows =

Re: spark-submit failing but job running from scala ide

2016-09-25 Thread vr spark
yes, i have both spark 1.6 and spark 2.0. I unset the spark home environment variable and pointed spark submit to 2.0. Its working now. How do i uninstall/remove spark 1.6 from mac? Thanks On Sun, Sep 25, 2016 at 4:28 AM, Jacek Laskowski wrote: > Hi, > > Can you execute

Re: In Spark-Scala, how to copy Array of Lists into new DataFrame?

2016-09-25 Thread Marco Mistroni
Hi in fact i have just found some written notes in my code see if this docs help you (it will work with any spark versions, not only 1.3.0) https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#creating-dataframes hth On Sun, Sep 25, 2016 at 1:25 PM, Marco Mistroni

Re: spark-submit failing but job running from scala ide

2016-09-25 Thread Jacek Laskowski
Hi, How did you install Spark 1.6? It's usually as simple as rm -rf $SPARK_1.6_HOME, but it really depends on how you installed it in the first place. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at