Re: Can't load a RandomForestClassificationModel in Spark job

2017-02-14 Thread Hollin Wilkins
Hey there, Creating a new SparkContext on workers will not work, only the driver is allowed to own a SparkContext. Are you trying to distribute your model to workers so you can create a distributed scoring service? If so, it may be worth looking into taking your models outside of a SparkContext

Re: Spark's execution plan debugging

2017-02-14 Thread ??????????
HI all, I have on issue about the text. -sortMergeJoin[c1#41,c1#98] what does 41 and 98 stand for please. thanks:) ---Original--- From: "Swapnil Shinde" Date: 2017/2/11 07:38:42 To: "Yong Zhang"; Cc:

Re: Spark Thrift Server - Skip header when load data from local file

2017-02-14 Thread ayan guha
I doubt you can do that. Create a staging table and then insert into main table after filtering the header On Wed, 15 Feb 2017 at 4:01 pm, kumar r wrote: > Hi, > > I want to load data from csv file to Spark table using Spark thrift > server. When i load, header(first line)

Spark Thrift Server - Skip header when load data from local file

2017-02-14 Thread kumar r
Hi, I want to load data from csv file to Spark table using Spark thrift server. When i load, header(first line) should be ignored. I have used tblproperties("skip.header.line.count"="1") option. But its not working and first line also included. Below is spark sql query i have tried. create

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Mohammad Kargar
I will. Thanks anyway Mohammad On Feb 14, 2017 7:24 PM, "Cody Koeninger" wrote: > Not sure what to tell you at that point - maybe compare what is > present in ZK to a known working group. > > On Tue, Feb 14, 2017 at 9:06 PM, Mohammad Kargar > wrote: > >

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Cody Koeninger
Not sure what to tell you at that point - maybe compare what is present in ZK to a known working group. On Tue, Feb 14, 2017 at 9:06 PM, Mohammad Kargar wrote: > Yes offset nodes are in zk and I can get the values. > > On Feb 14, 2017 6:54 PM, "Cody Koeninger"

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread arun kumar Natva
Hi John, The number of rows in input file is 30 billion rows. The size of input data is 72 GB, and the output is expected to have readings for each account & day combination for 50k sample accounts, which means total output records count = 50k * 365 On Tue, Feb 14, 2017 at 6:29 PM, Jörn Franke

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Mohammad Kargar
Yes offset nodes are in zk and I can get the values. On Feb 14, 2017 6:54 PM, "Cody Koeninger" wrote: > Are there offset nodes in ZK? > > for a consumer group named mygroup, a topic named test, and partition > 0 you should be able to connect to ZK and do e.g. > > get

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Cody Koeninger
Are there offset nodes in ZK? for a consumer group named mygroup, a topic named test, and partition 0 you should be able to connect to ZK and do e.g. get /consumers/mygroup/offsets/test/0 If those don't exist, those are the ZK nodes you would need to make sure get created / updated from your

Re: Write JavaDStream to Kafka (how?)

2017-02-14 Thread Cody Koeninger
It looks like you're creating a kafka producer on the driver, and attempting to write the string representation of stringIntegerJavaPairRDD. Instead, you probably want to be calling stringIntegerJavaPairRDD.foreachPartition, so that producing to kafka is being done on the executor. Read

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Cody Koeninger
Can you explain what wasn't successful and/or show code? On Tue, Feb 14, 2017 at 6:03 PM, Mohammad Kargar wrote: > As explained here, direct approach of integration between spark streaming > and kafka does not update offsets in Zookeeper, hence Zookeeper-based Kafka >

Re: How to specify default value for StructField?

2017-02-14 Thread Yong Zhang
You maybe are looking for something like "spark.sql.parquet.mergeSchema" for ORC. Unfortunately, I don't think it is available, unless someone tells me I am wrong. You can create a JIRA to request this feature, but we all know that Parquet is the first citizen format [] Yong

RE: Can't load a RandomForestClassificationModel in Spark job

2017-02-14 Thread Jianhong Xia
Is there any update on this problem? I encountered the same issue that was mentioned here. I have CrossValidatorModel.transform(df) running on workers, which requires DataFrame as an input. However, we only have Arrays on workers. When we deploy our model into cluster mode, we could not create

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread bryan . jeffrey
Mohammad, We store our offsets in Cassandra,  and use that for tracking. This solved a few issues for us,  as it provides a good persistence mechanism even when you're reading from multiple clusters. Bryan Jeffrey Get Outlook for Android On Tue, Feb 14, 2017 at 7:03 PM

Re: using spark-xml_2.10 to extract data from XML file

2017-02-14 Thread Hyukjin Kwon
Hi Carlo, There was a bug in lower versions when accessing to nested values in the library. Otherwise, I suspect another issue about parsing malformed XML. Could you maybe open an issue in https://github.com/databricks/spark-xml/issues with your sample data? I will stick with it until it is

PySpark: use one column to index another (udf of two columns?)

2017-02-14 Thread apu
Let's say I have a Spark (PySpark) dataframe with the following schema: root |-- myarray: array (nullable = true) ||-- element: string (containsNull = true) |-- myindices: array (nullable = true) ||-- element: integer (containsNull = true) It looks like:

streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Mohammad Kargar
As explained here , direct approach of integration between spark streaming and kafka does not update offsets in Zookeeper, hence Zookeeper-based Kafka monitoring tools will not show progress (details). We followed the

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread Jörn Franke
Can you check in the UI which tasks took most of the time? Even the 45 min looks a little bit much given that you only work most of the time with 50k rows > On 15 Feb 2017, at 00:03, Timur Shenkao wrote: > > Hello, > I'm not sure that's your reason but check this

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread Timur Shenkao
Hello, I'm not sure that's your reason but check this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-td20803.html On Tue, Feb 14, 2017 at 9:25 PM, anatva wrote: > Hi, > I am reading an ORC

Re: fault tolerant dataframe write with overwrite

2017-02-14 Thread Jörn Franke
If you use S3 you can first copy it into a temporary folder on HDFS. However for the checkpointing I would use the spark implementation. You can load also the file from S3 and checkpoint to HDFS. > On 14 Feb 2017, at 17:43, Mendelson, Assaf wrote: > > Thanks, I didn’t

My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread anatva
Hi, I am reading an ORC file, and perform some joins, aggregations and finally generate a dense vector to perform analytics. The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the same code is migrated to run on spark 2.0 on the same cluster, it takes around 4-5 hours. It is

Re: using spark-xml_2.10 to extract data from XML file

2017-02-14 Thread Carlo . Allocca
more specifically: Given the following XML data structure: This is the Structure of the XML file: xocs:doc |-- xocs:item: struct (nullable = true) ||-- bibrecord: struct (nullable = true) |||-- head: struct (nullable = true) ||||-- abstracts: struct (nullable = true)

Re: using spark-xml_2.10 to extract data from XML file

2017-02-14 Thread Carlo . Allocca
Dear All, I would like to ask you help about the following issue when using spark-xml_2.10: Given a XML file with the following structure: xocs:doc |-- xocs:item: struct (nullable = true) ||-- bibrecord: struct (nullable = true) |||-- head: struct (nullable = true) |||

Re: Strange behavior with 'not' and filter pushdown

2017-02-14 Thread Everett Anderson
Wrapping this up -- fix is in 2.1.0 and has been backported to the 2.0.x branch, as well. On Mon, Feb 13, 2017 at 6:41 PM, Everett Anderson wrote: > Went ahead and opened > > https://issues.apache.org/jira/browse/SPARK-19586 > > though I'd generally expect to just close it as

RE: fault tolerant dataframe write with overwrite

2017-02-14 Thread Mendelson, Assaf
Thanks, I didn’t know the Hadoop API supports other file systems other than HDFS and local file system (when there is 1 node). My main goal is indeed for checkpointing, every N iterations I save the data for future use. The problem is that if I use overwrite mode then it first deletes and then

Re: Parquet Gzipped Files

2017-02-14 Thread Benjamin Kim
Jörn, I agree with you, but the vendor is a little difficult to work with. For now, I will try to decompress it from S3 and save it plainly into HDFS. If someone already has this example, please let me know. Cheers, Ben > On Feb 13, 2017, at 9:50 AM, Jörn Franke wrote:

RE: How to specify default value for StructField?

2017-02-14 Thread Begar, Veena
Thanks, it didn't work. Because, the folder has files from 2 different schemas. It fails with the following exception: org.apache.spark.sql.AnalysisException: cannot resolve '`f2`' given input columns: [f1]; -Original Message- From: smartzjp [mailto:zjp_j...@163.com] Sent: Tuesday,

Re: How to specify default value for StructField?

2017-02-14 Thread smartzjp
You can try the below code. val df = spark.read.format("orc").load("/user/hos/orc_files_test_together") df.select(“f1”,”f2”).show 在 2017/2/14

Reusing HBase connection in transformations

2017-02-14 Thread DandyDev
Hi! I'm struggling with the following problem: I have a couple of Spark Streaming jobs that keep state (using mapWithState, and in one case updateStateByKey) and write their results to HBase. One of the Streaming jobs, needs the results that the other Streaming job writes to HBase. How it's

Re: Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-14 Thread Thakrar, Jayesh
Nancy, As the message from Spark indicates, spark.shuffle.memoryFraction is no longer used. It’s a unified heap space for both data caching and other things. Also, the previous 11 GB was not sufficient, and you are making the executor memory even smaller, not sure how it will work. From: nancy

Re: Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Sam Elamin
ah if thats the case then you might need to define the schema before hand. Either that or if you want to infer it then ensure a jsonfile exists with the right schema so spark infers the right columns essentially making both files one dataframe if that makes sense On Tue, Feb 14, 2017 at 3:04 PM,

Re: Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Aseem Bansal
Sorry if I trivialized the example. It is the same kind of file and sometimes it could have "a", sometimes "b", sometimes both. I just don't know. That is what I meant by missing columns. It would be good if I read any of the JSON and if I do spark sql and it gave me for json1.json a | b 1 |

Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Galen Marchetti
Anis, If your random partitions are smaller than your smallest machine, and you request executors for your spark jobs no larger than your smallest machine, then spark/cluster manager will automatically assign many executors to your larger machines. As long as you request small executors, you

Re: Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Sam Elamin
I may be missing something super obvious here but can't you combine them into a single dataframe. Left join perhaps? Try writing it in sql " select a from json1 and b from josn2"then run explain to give you a hint to how to do it in code Regards Sam On Tue, 14 Feb 2017 at 14:30, Aseem Bansal

Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Aseem Bansal
Say I have two files containing single rows json1.json {"a": 1} json2.json {"b": 2} I read in this json file using spark's API into a dataframe one at a time. So I have Dataset json1DF and Dataset json2DF If I run "select a, b from __THIS__" in a SQLTransformer then I will get an exception

HiveContext on Spark 1.6 Linkage Error:ClassCastException

2017-02-14 Thread Enrico DUrso
Hello guys, hope all of you are ok. I am trying to use HiveContext on Spark 1.6, I am developing using Eclipse and I placed the hive-site.xml in the classPath, so doing I use the Hive instance running on my cluster instead of creating a local metastore and a local warehouse. So far so good, in

HiveContext on Spark 1.6 Linkage Error:ClassCastException

2017-02-14 Thread Enrico DUrso
Hello guys, hope all of you are ok. I am trying to use HiveContext on Spark 1.6, I am developing using Eclipse and I placed the hive-site.xml in the classPath, so doing I use the Hive instance running on my cluster instead of creating a local metastore and a local warehouse. So far so good, in

NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions writing to Hive

2017-02-14 Thread nimrodo
Hi, I'm trying to write a DataFrame to a Hive partitioned table. This works fine from spark-shell, however when I use spark-submit i get the following exception: Exception in thread "main" java.lang.NoSuchMethodException:

Re: fault tolerant dataframe write with overwrite

2017-02-14 Thread Steve Loughran
On 14 Feb 2017, at 11:12, Mendelson, Assaf > wrote: I know how to get the filesystem, the problem is that this means using Hadoop directly so if in the future we change to something else (e.g. S3) I would need to rewrite the code. well,

Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Anis Nasir
Thank you very much for your reply. I guess this approach balances the load across the cluster of machines. However, I am looking for something for heterogeneous cluster for which the distribution is not known in prior. Cheers, Anis On Tue, 14 Feb 2017 at 20:19, Galen Marchetti

Re: how to fix the order of data

2017-02-14 Thread ??????????
IT works well now, thanks ---Original--- From: "Sam Elamin" Date: 2017/2/14 19:54:36 To: "??"<1427357...@qq.com>; Cc: "user"; Subject: Re: how to fix the order of data Its because you are just printing on the rdd You can sort the df like

Re: how to fix the order of data

2017-02-14 Thread Sam Elamin
Its because you are just printing on the rdd You can sort the df like below input.toDF().sort().collect() or if you do not want to convert to a dataframe you can use the sort by *sortByKey*([*ascending*], [*numTasks*]) Regards Sam On Tue, Feb 14, 2017 at 11:41 AM, 萝卜丝炒饭

how to fix the order of data

2017-02-14 Thread ??????????
HI all, the belowing is my test code. I found the output of val input is different. how do i fix the order please? scala> val input = sc.parallelize( Array(1,2,3)) input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at :24 scala> input.foreach(print) 132 scala>

Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Galen Marchetti
Anis, I've typically seen people handle skew by seeding the keys corresponding to high volumes with random values, then partitioning the dataset based on the original key *and* the random value, then reducing. Ex: ( , ) -> ( , , ) This transformation reduces the size of the huge partition,

RE: fault tolerant dataframe write with overwrite

2017-02-14 Thread Mendelson, Assaf
I know how to get the filesystem, the problem is that this means using Hadoop directly so if in the future we change to something else (e.g. S3) I would need to rewrite the code. This also relate to finding the last iteration, I would need to use Hadoop filesystem which is not agnostic to the

Re: fault tolerant dataframe write with overwrite

2017-02-14 Thread Jörn Franke
Normally you can fetch the filesystem interface from the configuration ( I assume you mean URI). Managing to get the last iteration: I do not understand the issue. You can have as the directory the current timestamp and at the end you simply select the directory with the highest number.

Different Results When Performing PCA with Spark and R

2017-02-14 Thread Amlan Jyoti
Dear all, I was exploring an use case of PCA , and found out that the results of Spark ML and R are different. More clearly, 1) eigenMatrix_Spark EQUALS-TO eigenMatrix_R 2) transformedData_Spark NOT-EQUALS-TO transformedData_R Sample Spark Code --

Re: [Spark Launcher] How to launch parallel jobs?

2017-02-14 Thread Cosmin Posteuca
Hi, Egor is right, for every partition it create a task, and every task run on a single core. But with different configurations spark has different results: 1 executor with 4 cores takes 120 seconds 2 executors with 2 cores each, takes twice 60 seconds, and once 120 seconds 4 executors with 1

fault tolerant dataframe write with overwrite

2017-02-14 Thread Mendelson, Assaf
Hi, I have a case where I have an iterative process which overwrites the results of a previous iteration. Every iteration I need to write a dataframe with the results. The problem is that when I write, if I simply overwrite the results of the previous iteration, this is not fault tolerant. i.e.

Re: wholeTextfiles not parallel, runs out of memory

2017-02-14 Thread Jörn Franke
Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz i.e. you will have only one executor per file maximum > On 14 Feb 2017, at 09:36, Henry Tremblay wrote: > > When I use wholeTextFiles, spark does not run in parallel, and yarn runs out > of

Re: [Spark Launcher] How to launch parallel jobs?

2017-02-14 Thread Cosmin Posteuca
Memory seems to be enough. My cluster has 22.5 gb total memory and my job use 6.88 gb. If i run twice this job, they will use 13.75 gb, but sometimes the cluster has a spike of memory of 19.5 gb. Thanks, Cosmin 2017-02-14 10:03 GMT+02:00 Mendelson, Assaf : > You should

RE: [Spark Launcher] How to launch parallel jobs?

2017-02-14 Thread Mendelson, Assaf
You should also check your memory usage. Let’s say for example you have 16 cores and 8 GB. And that you use 4 executors with 1 core each. When you use an executor, spark reserves it from yarn and yarn allocates the number of cores (e.g. 1 in our case) and the memory. The memory is actually more

Re: Is it better to Use Java or Python on Scala for Spark for using big data sets

2017-02-14 Thread Gourav Sengupta
SPARK is written in SCALA and that is why its so efficient and super fast. If you are looking for efficiency, speed, and easily maintainable code (where you do not have to write thousands of lines of code instead of a few lines) please use Scala functional programming. I have seen horrible JAVA

Handling Skewness and Heterogeneity

2017-02-14 Thread Anis Nasir
Dear All, I have few use cases for spark streaming where spark cluster consist of heterogenous machines. Additionally, there is skew present in both the input distribution (e.g., each tuple is drawn from a zipf distribution) and the service time (e.g., service time required for each tuple comes