Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread Bijay Pathak
Hi Swetha, One option is to use Hive with the above issues fixed which is Hive 2.0 or Cloudera CDH Hive 1.2 which has above issue resolved. One thing to remember is it's not the Hive you have installed but the Hive Spark is using which in Spark 1.6 is Hive version 1.2 as of now. The workaround I

Re: OutOfMemoryError - When saving Word2Vec

2016-06-13 Thread Yuhao Yang
Hi Sharad, what's your vocabulary size and vector length for Word2Vec? Regards, Yuhao 2016-06-13 20:04 GMT+08:00 sharad82 : > Is this the right forum to post Spark related issues ? I have tried this > forum along with StackOverflow but not seeing any response. > >

Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Takeshi Yamamuro
Hi, You can control an initial num. of partitions (tasks) in v2.0. https://www.mail-archive.com/user@spark.apache.org/msg51603.html // maropu On Tue, Jun 14, 2016 at 7:24 AM, Mich Talebzadeh wrote: > Have you looked at spark GUI to see what it is waiting for. is

Re: Suggestions on Lambda Architecture in Spark

2016-06-13 Thread Luciano Resende
This might be useful: https://spark-summit.org/east-2016/events/lambda-at-weather-scale/ On Monday, June 13, 2016, KhajaAsmath Mohammed wrote: > Hi, > > In my current project, we are planning to implement POC for lambda > architecture using spark streaming. My use case

Suggestions on Lambda Architecture in Spark

2016-06-13 Thread KhajaAsmath Mohammed
Hi, In my current project, we are planning to implement POC for lambda architecture using spark streaming. My use case would be Kafka --> bacth layer --> Saprk SQL --> Cassandra Kafka --> Speed layer --> Spark Streaming --> Cassandra Serving later --> Contact both the layers but I am not sure

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Arun Patel
Thanks Michael. I went thru these slides already and could not find answers for these specific questions. I created a Dataset and converted it to DataFrame in 1.6 and 2.0. I don't see any difference in 1.6 vs 2.0. So, I really got confused and asked these questions about unification.

Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Mich Talebzadeh
Have you looked at spark GUI to see what it is waiting for. is that available memory. What is the resource manager you are using? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Spark Memory Error - Not enough space to cache broadcast

2016-06-13 Thread Cassa L
Hi, I'm using spark 1.5.1 version. I am reading data from Kafka into Spark and writing it into Cassandra after processing it. Spark job starts fine and runs all good for some time until I start getting below errors. Once these errors come, job start to lag behind and I see that job has scheduling

how to investigate skew and DataFrames and RangePartitioner

2016-06-13 Thread Peter Halliday
I have two questions First,I have a failure when I write parquet from Spark 1.6.1 on Amazon EMR to S3. This is full batch, which is over 200GB of source data. The partitioning is based on a geographic identifier we use, and also a date we got the data. However, because of geographical

Re: LegacyAccumulatorWrapper basically requires the Accumulator value to implement equlas() or it will fail on isZero()

2016-06-13 Thread Amit Sela
I thought so, and I agree. Still good to have this indexed here :) On Mon, Jun 13, 2016 at 10:43 PM Sean Owen wrote: > I think that's right, but that seems as expected. If you're going to > use this utility wrapper class, it can only determine if something is > zero by

Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Khaled Hammouda
Hi Michael, Thanks for the suggestion to use Spark 2.0 preview. I just downloaded the preview and tried using it, but I’m running into the exact same issue. Khaled > On Jun 13, 2016, at 2:58 PM, Michael Armbrust wrote: > > You might try with the Spark 2.0 preview. We

Re: LegacyAccumulatorWrapper basically requires the Accumulator value to implement equlas() or it will fail on isZero()

2016-06-13 Thread Sean Owen
I think that's right, but that seems as expected. If you're going to use this utility wrapper class, it can only determine if something is zero by comparing it to your 'zero' object, and that means defining equality. I suspect it's uncommon to accumulate things that aren't primitives or standard

LegacyAccumulatorWrapper basically requires the Accumulator value to implement equlas() or it will fail on isZero()

2016-06-13 Thread Amit Sela
It seems that if you have an AccumulatorParam (or AccumulableParam) where "R" is not a primitive, it will need to implement equlas() if the implementation of the zero() creates a new instance (which I guess it will in those cases). This is where isZero applies the comparison:

Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Michael Armbrust
You might try with the Spark 2.0 preview. We spent a bunch of time improving the handling of many small files. On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda wrote: > I'm trying to use Spark SQL to load json data that are split across about > 70k > files across 24

Re: Spark Thrift Server in CDH 5.3

2016-06-13 Thread Michael Armbrust
I'd try asking on the cloudera forums. On Sun, Jun 12, 2016 at 9:51 PM, pooja mehta wrote: > Hi, > > How do I start Spark Thrift Server with cloudera CDH 5.3? > > Thanks. >

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Michael Armbrust
Here's a talk I gave on the topic: https://www.youtube.com/watch?v=i7l3JQRx7Qw http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel wrote: > In Spark 2.0, DataFrames

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
Hi Mich, Following is a sample code snippet: *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId", "userRecord").persist() System.*out*.println(" userRecsDF.partitions.size"+ userRecsDF.partitions.size) userDF.registerTempTable("userRecordsTemp") sqlContext.sql("SET

Is there a limit on the number of tasks in one job?

2016-06-13 Thread khaled.hammouda
I'm trying to use Spark SQL to load json data that are split across about 70k files across 24 directories in hdfs, using sqlContext.read.json("hdfs:///user/hadoop/data/*/*"). This doesn't seem to work for some reason, I get timeout errors like the following: --- 6/06/13 15:46:31 ERROR

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
Hi Bijay, If I am hitting this issue, https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done? Incrementing to higher version of hive is the only solution? Thanks! On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > Hi, > > Following is a

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-13 Thread Deepak Sharma
Hi Ajay Looking at spark code , i can see you used hive context. Can you try using sql context instead of hive context there? Thanks Deepak On Mon, Jun 13, 2016 at 10:15 PM, Ajay Chander wrote: > Hi Mohit, > > Thanks for your time. Please find my response below. > > Did

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
Hi, Following is a sample code snippet: *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId", "userRecord").persist() System.*out*.println(" userRecsDF.partitions.size"+ userRecsDF.partitions.size) userDF.registerTempTable("userRecordsTemp") sqlContext.sql("SET

Computing on each partition/executor with "persistent" data

2016-06-13 Thread Jeroen Miller
Dear fellow Sparkers, I am barely dipping my toes into the Spark world and I was wondering if the ​ ​ following workflow can be implemented in Spark: 1. Initialize custom data structure DS on each executor . These data structures DS should live until the end of the program.

Re: Spark Streaming application failing with Kerboros issue while writing data to HBase

2016-06-13 Thread Ted Yu
Can you show snippet of your code, please ? Please refer to obtainTokenForHBase() in yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala Cheers On Mon, Jun 13, 2016 at 4:44 AM, Kamesh wrote: > Hi All, > We are building a spark streaming

Re: What is the interpretation of Cores in Spark doc

2016-06-13 Thread Mark Hamstra
I don't know what documentation you were referring to, but this is clearly an erroneous statement: "Threads are virtual cores." At best it is terminology abuse by a hardware manufacturer. Regardless, Spark can't get too concerned about how any particular hardware vendor wants to refer to the

SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-13 Thread Ajay Chander
Hi Mohit, Thanks for your time. Please find my response below. Did you try the same with another database? I do load the data from MySQL and SQL Server the same way(through SPARK SQL JDBC) which works perfectly alright. As a workaround you can write the select statement yourself instead of just

Re: Basic question. Access MongoDB data in Spark.

2016-06-13 Thread Prajwal Tuladhar
May be try opening an issue in their GH repo https://github.com/Stratio/Spark-MongoDB On Mon, Jun 13, 2016 at 4:10 PM, Umair Janjua wrote: > Anybody knows the stratio's mailing list? I cant seem to find it. Cheers > > On Mon, Jun 13, 2016 at 6:02 PM, Ted Yu

Re: Basic question. Access MongoDB data in Spark.

2016-06-13 Thread Umair Janjua
Anybody knows the stratio's mailing list? I cant seem to find it. Cheers On Mon, Jun 13, 2016 at 6:02 PM, Ted Yu wrote: > Have you considered posting the question on stratio's mailing list ? > > You may get faster response there. > > > On Mon, Jun 13, 2016 at 8:09 AM, Umair

Re: Basic question. Access MongoDB data in Spark.

2016-06-13 Thread Ted Yu
Have you considered posting the question on stratio's mailing list ? You may get faster response there. On Mon, Jun 13, 2016 at 8:09 AM, Umair Janjua wrote: > Hi guys, > > I have this super basic problem which I cannot figure out. Can somebody > give me a hint. > >

Re: Kafka Exceptions

2016-06-13 Thread Cody Koeninger
Is the exception on the driver or the executor? To be clear, you're going to see a task fail if a partition changes leader while the task is running, regardless of configuration settings. The task should be retried up the maxFailures though. What are maxRetries and maxFailures set to? How

Re: Kafka Exceptions

2016-06-13 Thread Bryan Jeffrey
Cody, We already set the maxRetries. We're still seeing issue - when leader is shifted, for example, it does not appear that direct stream reader correctly handles this. We're running 1.6.1. Bryan Jeffrey On Mon, Jun 13, 2016 at 10:37 AM, Cody Koeninger wrote: >

Basic question. Access MongoDB data in Spark.

2016-06-13 Thread Umair Janjua
Hi guys, I have this super basic problem which I cannot figure out. Can somebody give me a hint. http://stackoverflow.com/questions/37793214/spark-mongdb-data-using-java Cheers

Re: Kafka Exceptions

2016-06-13 Thread Cody Koeninger
http://spark.apache.org/docs/latest/configuration.html spark.streaming.kafka.maxRetries spark.task.maxFailures On Mon, Jun 13, 2016 at 8:25 AM, Bryan Jeffrey wrote: > All, > > We're running a Spark job that is consuming data from a large Kafka cluster > using the

Kafka Exceptions

2016-06-13 Thread Bryan Jeffrey
All, We're running a Spark job that is consuming data from a large Kafka cluster using the Direct Stream receiver. We're seeing intermittent NotLeaderForPartitionExceptions when the leader is moved to another broker. Currently even with retry enabled we're seeing the job fail at this exception.

Re: OutOfMemoryError - When saving Word2Vec

2016-06-13 Thread sharad82
Is this the right forum to post Spark related issues ? I have tried this forum along with StackOverflow but not seeing any response. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-When-saving-Word2Vec-tp27142p27151.html Sent from the

Spark Streaming application failing with Kerboros issue while writing data to HBase

2016-06-13 Thread Kamesh
Hi All, We are building a spark streaming application and that application writes data to HBase table. But writes/reads are failing with following exception 16/06/13 04:35:16 ERROR ipc.AbstractRpcClient: SASL authentication failed. The most likely cause is missing or invalid credentials.

Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Arun Patel
In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an alias for a Dataset of type row. I have few questions. 1) What does this really mean to an Application developer? 2) Why this unification was needed in Spark 2.0? 3) What changes can be observed in Spark 2.0 vs Spark 1.6?

Re: Spark Streamming checkpoint and restoring files from S3

2016-06-13 Thread Natu Lauchande
Hi, It seems to me that the checkpoint command is not persisting the SparkContext hadoop configuration correctly . Can this be a possibility ? Thanks, Natu On Mon, Jun 13, 2016 at 11:57 AM, Natu Lauchande wrote: > Hi, > > I am testing disaster recovery from Spark and

Spark Streamming checkpoint and restoring files from S3

2016-06-13 Thread Natu Lauchande
Hi, I am testing disaster recovery from Spark and having some issues when trying to restore an input file from s3 : 2016-06-13 11:42:52,420 [main] INFO org.apache.spark.streaming.dstream.FileInputDStream$FileInputDStreamCheckpointData - Restoring files for time 146581086 ms -

Re: Java MongoDB Spark Stratio (Please give me a hint)

2016-06-13 Thread Umair Janjua
The dataframe does not get any data. What could I be doing wrong here. I have rechecked the credentials and other stuff. I am still trying to debug the issue without any luck so far. On Mon, Jun 13, 2016 at 11:30 AM, Umair Janjua wrote: > Any idea what I might be doing

Re: Java MongoDB Spark Stratio (Please give me a hint)

2016-06-13 Thread Umair Janjua
Any idea what I might be doing wrong. I am new to spark and I cannot proceed forward from here: --- JavaSparkContext sc = new JavaSparkContext("local[*]", "test spark-mongodb java"); SQLContext sqlContext = new

Re: Spark Getting data from MongoDB in JAVA

2016-06-13 Thread Asfandyar Ashraf Malik
Yes, It was a dependency issue. I was using incompatible versions. The command *mvn dependency:tree -Dverbose *helped me fix this. Cheers Asfandyar Ashraf Malik Mobile: +49 15751174449 <%2B49%20151%20230%20130%2066> Email: asfand...@kreditech.com <%2B49%20151%20230%20130%2066> Kreditech

cluster mode for Python on standalone cluster

2016-06-13 Thread Jan Šourek
The official documentation states 'Currently only YARN supports cluster mode for Python applications.' I would like to know if work is being done or planned to support cluster mode for Python applications on standalone spark clusters?

cluster mode for Python on standalone clusters

2016-06-13 Thread Jan Sourek
The official documentation states 'Currently only YARN supports cluster mode for Python applications.' I would like to know if work is being done or planned to support cluster mode for Python applications on standalone spark clusters? -- View this message in context:

Re: Another problem about parallel computing

2016-06-13 Thread Terry Hoo
hero, Did you check whether there is any exception after retry? If the port is 0, the spark worker should bind to a random port. BTW, what's the spark version? Regards, - Terry On Mon, Jun 13, 2016 at 4:24 PM, hero wrote: > Hi, guys > > I have another problem about

Another problem about parallel computing

2016-06-13 Thread hero
Hi, guys I have another problem about spark yarn. Today, i was running start-all.sh when i found only two worker in the Web Ui, and in fact, I have four nodes. The only display of the two nodes, one is master, one is slave2. the '/etc/hosts' file is show below: /127.0.0.1 localhost//

Spark 2.0.0 : GLM problem

2016-06-13 Thread april_ZMQ
Hi ALL, I’ve tried the GLM (General Linear Model) of Spark 2.0.0-preview. And I’ve countered some unexpected problems. • First problem: I test the “poisson” family type GLM with a very small dataset using SparkR 2.0.0 This dataset can run “poisson” family type GLM in general R successfully.

Re: Should I avoid "state" in an Spark application?

2016-06-13 Thread Alonso Isidoro Roman
Hi Haopu, please check these threads: http://stackoverflow.com/questions/24331815/spark-streaming-historical-state https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Hive 1.0.0 not able to read Spark 1.6.1 parquet output files on EMR 4.7.0

2016-06-13 Thread mayankshete
Hello Team , I am facing an issue where output files generated by Spark 1.6.1 are not read by Hive 1.0.0 . It is because Hive 1.0.0 uses older parquet version than Spark 1.6.1 which is using 1.7.0 parquet . Is it possible that we can use older parquet version in Spark or newer parquet version in

Dataframe : Column features must be of type org.apache.spark.mllib.linalg.VectorUDT

2016-06-13 Thread Zakaria Hili
Hi, I create a dataframe using a schema, but when I try to create a model, I receive this error: requirement failed: Column features must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually ArrayType(StringType,true) piece of code SQLContext sqlContext =

Re: StackOverflow in Spark

2016-06-13 Thread Terry Hoo
Maybe the same issue with SPARK_6847 , which has been fixed in spark 2.0 Regards - Terry On Mon, Jun 13, 2016 at 3:15 PM, Michel Hubert wrote: > > > I’ve found my problem. > > > > I’ve got a DAG with two consecutive

RE: StackOverflow in Spark

2016-06-13 Thread Michel Hubert
I’ve found my problem. I’ve got a DAG with two consecutive “updateStateByKey” functions . When I only process (map/foreachRDD/JavaEsSpark) the state of the last “updateStateByKey” function, I get an stackoverflow after a while (too long linage). But when I also do some processing