Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-26 Thread obaidul karim
Hi Guys, This is my first mail to spark users mailing list. I need help on Dstream operation. In fact, I am using a MLlib randomforest model to predict using spark streaming. In the end, I want to combine the feature Dstream & prediction Dstream together for further downstream processing. I am

Sample scala program using CMU Sphinx4

2016-05-26 Thread Vajra L
Folks, Does anyone have a sample program for Speech recognition in CMU Sphinx4 on Spark Scala? Thanks. Vajra - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
yeah that could work, since i should know (or be able to find out) all the input columns On Thu, May 26, 2016 at 11:30 PM, Takeshi Yamamuro wrote: > You couldn't do like this? > > -- > val func = udf((i: Int) => Tuple2(i, i)) > val df = Seq((1, ..., 0), (2, ...,

Re: Spark input size when filtering on parquet files

2016-05-26 Thread Takeshi Yamamuro
Hi, Spark just prints #bytes in the web UI that is accumulated from InputSplit#getLength (it is just a length of files). Therefore, I'm afraid this metric does not reflect actual read #bytes for parquet. If you get the metric, you need to use other tools such as iostat or something. // maropu

Re: Spark Job Execution halts during shuffle...

2016-05-26 Thread Ted Yu
Priya: Have you checked the executor logs on hostname1 and hostname2 ? Cheers On Thu, May 26, 2016 at 8:00 PM, Takeshi Yamamuro wrote: > Hi, > > If you get stuck in job fails, one of best practices is to increase > #partitions. > Also, you'd better off using DataFrame

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
yes, but i also need all the columns (plus of course the 2 new ones) in my output. your select operation drops all the input columns. best, koert On Thu, May 26, 2016 at 11:02 PM, Takeshi Yamamuro wrote: > Couldn't you include all the needed columns in your input

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Takeshi Yamamuro
Couldn't you include all the needed columns in your input dataframe? // maropu On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers wrote: > that is nice and compact, but it does not add the columns to an existing > dataframe > > On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro

Re: Spark Job Execution halts during shuffle...

2016-05-26 Thread Takeshi Yamamuro
Hi, If you get stuck in job fails, one of best practices is to increase #partitions. Also, you'd better off using DataFrame instread of RDD in terms of join optimization. // maropu On Thu, May 26, 2016 at 11:40 PM, Priya Ch wrote: > Hello Team, > > > I am

RE: Not able to write output to local filsystem from Standalone mode.

2016-05-26 Thread Yong Zhang
That just makes sense, doesn't it? The only place will be driver. If not, the executor will be having contention by whom should create the directory in this case. Only the coordinator (driver in this case) is the best place for doing it. Yong From: math...@closetwork.org Date: Wed, 25 May 2016

Re: Pros and Cons

2016-05-26 Thread Koert Kuipers
We do disk-to-disk iterative algorithms in spark all the time, on datasets that do not fit in memory, and it works well for us. I usually have to do some tuning of number of partitions for a new dataset but that's about it in terms of inconveniences. On May 26, 2016 2:07 AM, "Jörn Franke"

Re: Insert into JDBC

2016-05-26 Thread Andrés Ivaldi
Done, version 1.6.1 has the fix, updated and work fine Thanks. On Thu, May 26, 2016 at 4:15 PM, Anthony May wrote: > It's on the 1.6 branch > > On Thu, May 26, 2016 at 4:43 PM Andrés Ivaldi wrote: > >> I see, I'm using Spark 1.6.0 and that change

Re: Insert into JDBC

2016-05-26 Thread Anthony May
It's on the 1.6 branch On Thu, May 26, 2016 at 4:43 PM Andrés Ivaldi wrote: > I see, I'm using Spark 1.6.0 and that change seems to be for 2.0 or maybe > it's in 1.6.1 looking at the history. > thanks I'll see if update spark to 1.6.1 > > On Thu, May 26, 2016 at 3:33 PM,

Re: Insert into JDBC

2016-05-26 Thread Andrés Ivaldi
I see, I'm using Spark 1.6.0 and that change seems to be for 2.0 or maybe it's in 1.6.1 looking at the history. thanks I'll see if update spark to 1.6.1 On Thu, May 26, 2016 at 3:33 PM, Anthony May wrote: > It doesn't appear to be configurable, but it is inserting by

Re: Insert into JDBC

2016-05-26 Thread Anthony May
It doesn't appear to be configurable, but it is inserting by column name: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L102 On Thu, 26 May 2016 at 16:02 Andrés Ivaldi wrote: > Hello, >

Re: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Reynold Xin
It's probably a good idea to have the vertica dialect too, since it doesn't seem like it'd be too difficult to maintain. It is not going to be as performant as the native Vertica data source, but is going to be much lighter weight. On Thu, May 26, 2016 at 3:09 PM, Mohammed Guller

RE: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Mohammed Guller
Vertica also provides a Spark connector. It was not GA the last time I looked at it, but available on the Vertica community site. Have you tried using the Vertica Spark connector instead of the JDBC driver? Mohammed Author: Big Data Analytics with

Insert into JDBC

2016-05-26 Thread Andrés Ivaldi
Hello, I'realize that when dataframe executes insert it is inserting by scheme order column instead by name, ie dataframe.write(SaveMode).jdbc(url, table, properties) Reading the profiler the execution is insert into TableName values(a,b,c..) what i need is insert into TableNames

Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Michael Armbrust
You can also just make sure that each user is using their own directory. A rough example can be found in TestHive. Note: in Spark 2.0 there should be no need to use HiveContext unless you need to talk to a metastore. On Thu, May 26, 2016 at 1:36 PM, Mich Talebzadeh

Spark input size when filtering on parquet files

2016-05-26 Thread Dennis Hunziker
Hi all I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to find out about the improvements made in filtering/scanning parquet files when querying for tables using SparkSQL and how these changes relate to the new filter API introduced in Parquet 1.7.0. After checking the

Re: How to set the degree of parallelism in Spark SQL?

2016-05-26 Thread Mich Talebzadeh
Also worth adding that in standalone mode there is only one executor per spark-submit job. In Standalone cluster mode Spark allocates resources based on cores. By default, an application will grab all the cores in the cluster. You only have one worker that lives within the driver JVM process

Re: Kafka connection logs in Spark

2016-05-26 Thread Cody Koeninger
Sounds like you better talk to Horton Works then On Thu, May 26, 2016 at 2:33 PM, Mail.com wrote: > Hi Cody, > > I used Horton Works jars for spark streaming that would enable get messages > from Kafka with kerberos. > > Thanks, > Pradeep > > >> On May 26, 2016, at

Re: Kafka connection logs in Spark

2016-05-26 Thread Mail.com
Hi Cody, I used Horton Works jars for spark streaming that would enable get messages from Kafka with kerberos. Thanks, Pradeep > On May 26, 2016, at 11:04 AM, Cody Koeninger wrote: > > I wouldn't expect kerberos to work with anything earlier than the beta > consumer for

Re: Problem instantiation of HiveContext

2016-05-26 Thread Ian
The exception indicates that Spark cannot invoke the method it's trying to call, which is probably caused by a library missing. Do you have a Hive configuration (hive-site.xml) or similar in your $SPARK_HOME/conf folder? -- View this message in context:

Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Gerard Maas
Thanks a lot for the advice!. I found out why the standalone hiveContext would not work: it was trying to deploy a derby db and the user had no rights to create the dir where there db is stored: Caused by: java.sql.SQLException: Failed to create database 'metastore_db', see the next exception

Re: Stackoverflowerror in scala.collection

2016-05-26 Thread Jeff Jones
I’ve seen this when I specified “too many” where clauses in the SQL query. I was able to adjust my query to use a single ‘in’ clause rather than many ‘=’ clauses but I realize that may not be an option in all cases. Jeff On 5/4/16, 2:04 PM, "BenD" wrote: >I am

Re: save RDD of Avro GenericRecord as parquet throws UnsupportedOperationException

2016-05-26 Thread Ian
Have you tried saveAsNewAPIHadoopFile? See: http://stackoverflow.com/questions/29238126/how-to-save-a-spark-rdd-to-an-avro-file -- View this message in context:

Re: List of questios about spark

2016-05-26 Thread Ian
I'll attempt to answer a few of your questions: There are no limitations with regard to the number of dimension or lookup tables for Spark. As long as you have disk space, you should have no problem. Obviously, if you do joins among dozens or hundreds of tables it may take a while since it's

Re: How to set the degree of parallelism in Spark SQL?

2016-05-26 Thread Ian
The number of executors is set when you launch the shell or an application with /spark-submit/. It's controlled by the /num-executors/ parameter: https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/. Important is also that cranking up the number may not

Re: Subtract two DataFrames is not working

2016-05-26 Thread Ted Yu
Can you be a bit more specific about how they didn't work ? BTW 1.4.1 seems to be an old release. Please try 1.6.1 if possible. Cheers On Thu, May 26, 2016 at 9:44 AM, Gurusamy Thirupathy wrote: > I have to subtract two dataframes, I tried with except method but it's not

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
that is nice and compact, but it does not add the columns to an existing dataframe On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro wrote: > Hi, > > How about this? > -- > val func = udf((i: Int) => Tuple2(i, i)) > val df = Seq((1, 0), (2, 5)).toDF("a", "b") >

Subtract two DataFrames is not working

2016-05-26 Thread Gurusamy Thirupathy
I have to subtract two dataframes, I tried with except method but it's not working. I tried with drop also. I am using spark 1.4.1 version. And Scala 2.10. Can you please help? Thanks, Guru

Distributed matrices with column counts represented by Int (rather than Long)

2016-05-26 Thread Phillip Henry
Hi, I notice that some DistributedMatrix represent the number of columns with an Int rather than a Long (RowMatrix etc). This limits the number of columns to about 2 billion. We're approaching that limit. What do people recommend we do to mitigate the problem? Are there plans to use a larger

Re: save RDD of Avro GenericRecord as parquet throws UnsupportedOperationException

2016-05-26 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtWmyYB5fweR=Re+Best+way+to+store+Avro+Objects+as+Parquet+using+SPARK On Thu, May 26, 2016 at 6:55 AM, Govindasamy, Nagarajan < ngovindas...@turbine.com> wrote: > Hi, > > I am trying to save RDD of Avro GenericRecord as parquet. I am

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Alonso Isidoro Roman
Thank you Cody, i will try to follow your advice. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2016-05-26 17:00 GMT+02:00 Cody Koeninger :

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Cody Koeninger
Honestly given this thread, and the stack overflow thread, I'd say you need to back up, start very simply, and learn spark. If for some reason the official docs aren't doing it for you, learning spark from oreilly is a good book. Given your specific question, why not just messages.foreachRDD {

JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Aaron Ilovici
I am attempting to write a DataFrame of Rows to Vertica via DataFrameWriter's jdbc function in the following manner: dataframe.write().mode(SaveMode.Append).jdbc(url, table, properties); This works when there are no NULL values in any of the Rows in my DataFrame. However, when there are rows,

Re: Kafka connection logs in Spark

2016-05-26 Thread Cody Koeninger
I wouldn't expect kerberos to work with anything earlier than the beta consumer for kafka 0.10 On Wed, May 25, 2016 at 9:41 PM, Mail.com wrote: > Hi All, > > I am connecting Spark 1.6 streaming to Kafka 0.8.2 with Kerberos. I ran > spark streaming in debug mode, but do

Spark Job Execution halts during shuffle...

2016-05-26 Thread Priya Ch
Hello Team, I am trying to perform join 2 rdds where one is of size 800 MB and the other is 190 MB. During the join step, my job halts and I don't see progress in the execution. This is the message I see on console - INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output

save RDD of Avro GenericRecord as parquet throws UnsupportedOperationException

2016-05-26 Thread Govindasamy, Nagarajan
Hi, I am trying to save RDD of Avro GenericRecord as parquet. I am using Spark 1.6.1. DStreamOfAvroGenericRecord.foreachRDD(rdd => rdd.toDF().write.parquet("s3://bucket/data.parquet")) Getting the following exception. Is there a way to save Avro GenericRecord as Parquet or ORC file?

Apache Spark Video Processing from NFS Shared storage: Advise needed

2016-05-26 Thread mobcdi
Hi all, Is it advisable to use nfs as shared storage for a small Spark cluster to process video and images? I have a total of 20 vms (2vCPU, 6GB Ram, 20GB Local Disk) connected to 500GB nfs shared storage (mounted the same in each of the vms) at my disposal and I'm wondering if I can avoid the

Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Mich Talebzadeh
To use HiveContext witch is basically an sql api within Spark without proper hive set up does not make sense. It is a super set of Spark SQLContext In addition simple things like registerTempTable may not work. HTH Dr Mich Talebzadeh LinkedIn *

System.exit in local mode ?

2016-05-26 Thread yael aharon
Hello, I have noticed that in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala spark would call System.exit if an uncaught exception was encountered. I have an application that is running spark in local mode, and would like

Re: Error while saving plots

2016-05-26 Thread Sonal Goyal
Does the path /home/njoshi/dev/outputs/test_/plots/ exist on the driver ? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Silvio Fiorito
Hi Gerard, I’ve never had an issue using the HiveContext without a hive-site.xml configured. However, one issue you may have is if multiple users are starting the HiveContext from the same path, they’ll all be trying to store the default Derby metastore in the same location. Also, if you want

Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Mich Talebzadeh
Hi Gerald, I am not sure the so called independence is will. I gather you want to use HiveContext for your SQL queries and sqlContext only provides a subset of HiveContext. try this val sc = new SparkContext(conf) // Create sqlContext based on HiveContext val sqlContext = new

Re: How spark depends on Guava

2016-05-26 Thread Steve Loughran
On 23 May 2016, at 06:32, Todd > wrote: Can someone please take alook at my question?I am spark-shell local mode and yarn-client mode.Spark code uses guava library,spark should have guava in place during run time. Thanks. At 2016-05-23 11:48:58,

Re: spark on yarn

2016-05-26 Thread Steve Loughran
> On 21 May 2016, at 15:14, Shushant Arora wrote: > > And will it allocate rest executors when other containers get freed which > were occupied by other hadoop jobs/spark applications? > requests will go into the queue(s), they'll stay outstanding until things free

Re: How to run large Hive queries in PySpark 1.2.1

2016-05-26 Thread Nikolay Voronchikhin
Hi Jörn, We will be upgrading to MapR 5.1, Hive 1.2, and Spark 1.6.1 at the end of June. In the meantime, still can this be done with these versions? There is not a firewall issue since we have edge nodes and cluster nodes hosted in the same location with the same NFS mount. On Thu, May 26,

HiveContext standalone => without a Hive metastore

2016-05-26 Thread Gerard Maas
Hi, I'm helping some folks setting up an analytics cluster with Spark. They want to use the HiveContext to enable the Window functions on DataFrames(*) but they don't have any Hive installation, nor they need one at the moment (if not necessary for this feature) When we try to create a Hive

Does decimal(6,-2) exists on purpose?

2016-05-26 Thread Ofir Manor
Hi, was surprised to notice a negative scale on decimal (Spark 1.6.1). To reproduce: scala> z.printSchema root |-- price: decimal(6,2) (nullable = true) scala> val a = z.selectExpr("round(price,-2)") a: org.apache.spark.sql.DataFrame = [round(price,-2): decimal(6,-2)] I expected the function

How to run large Hive queries in PySpark 1.2.1

2016-05-26 Thread Nikolay Voronchikhin
Hi PySpark users, We need to be able to run large Hive queries in PySpark 1.2.1. Users are running PySpark on an Edge Node, and submit jobs to a Cluster that allocates YARN resources to the clients. We are using MapR as the Hadoop Distribution on top of Hive 0.13 and Spark 1.2.1. Currently, our

Re: Facing issues while reading parquet file in spark 1.2.1

2016-05-26 Thread vaibhav srivastava
Any suggestions? On 25 May 2016 17:25, "vaibhav srivastava" wrote: > Hi, > I am using spark 1.2.1. when I am trying to read a parquet file using SQL > context.parquetFile("path to file") . The parquet file is using > parquethiveserde and input format is

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Alonso Isidoro Roman
Hi Matthias and Cody, You can see in the code that StreamingContext.start() is called after the messages.foreachRDD output action. Another problem @Cody is how can i avoid the inner .foreachRDD(_.foreachPartition(it => recommender.predictWithALS(it.toSeq))) in order to invoke asynchronously

Re: Pros and Cons

2016-05-26 Thread Jörn Franke
Spark can handle this true, but it is optimized for the idea that it works it works on the same full dataset in-memory due to the underlying nature of machine learning algorithms (iterative). Of course, you can spill over, but that you should avoid. That being said you should have read my