multiple spark streaming contexts

2016-07-30 Thread Sumit Khanna
Hey, Was wondering if I could create multiple spark stream contexts in my application (e.g instantiating a worker actor per topic and it has its own streaming context its own batch duration everything). What are the caveats if any? What are the best practices? Have googled half heartedly on the

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread Bhaarat Sharma
I am just trying to do this as a proof of concept. The actual content of the files will be quite bit. I'm having problem using foreach or something similar on an RDD. sc.binaryFiles("/root/sift_images_test/*.jpg") returns ("filename1", bytes) ("filname2",bytes) I'm wondering if there is a do

Re: How to filter based on a constant value

2016-07-30 Thread Xinh Huynh
Hi Mitch, I think you were missing a step: [your result] maxdate: org.apache.spark.sql.Row = [2015-12-15] Since maxdate is of type Row, you would want to extract the first column of the Row with: >> val maxdateStr = maxdate.getString(0) assuming the column type is String. API doc is here:

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread ayan guha
This sounds a bad idea, given hdfs does not work well with small files. On Sun, Jul 31, 2016 at 8:57 AM, Bhaarat Sharma wrote: > I am reading bunch of files in PySpark using binaryFiles. Then I want to > get the number of bytes for each file and write this number to an HDFS

Re: How to filter based on a constant value

2016-07-30 Thread ayan guha
select * from (select *, rank() over (order by transactiondate) r from ll_18740868 where transactiondescription='XYZ' ) inner where r=1 Hi Mitch, If using SQL is fine, you can try the code above. You need to register ll_18740868 as temp table. On Sun, Jul 31, 2016 at

Re: PySpark 1.6.1: 'builtin_function_or_method' object has no attribute '__code__' in Pickles

2016-07-30 Thread ayan guha
Hi Glad that your problem is resolved. spark-submit is the recommended way of submitting application (Pyspark internally does spark-submit) Yes, the process remains same single node vs multiple node. However, I would suggest to use any of the cluster mode instead of the local mode. In single

spark-submit hangs forever after all tasks finish(spark 2.0.0 stable version on yarn)

2016-07-30 Thread taozhuo
below is the error messages that seem run infinitely: 16/07/30 23:25:38 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 1ms 16/07/30 23:25:39 DEBUG Client: IPC Client (1735131305) connection to /10.80.1.168:8032 from zhuotao sending #147247 16/07/30 23:25:39 DEBUG Client: IPC Client

Re: Structured Streaming Parquet Sink

2016-07-30 Thread Arun Patel
Thanks for the response. However, I am not able to use any output mode. In case of Parquet sink, there should not be any aggregations? scala> val query = streamingCountsDF.writeStream.format("parquet").option("path","parq").option("checkpointLocation","chkpnt").outputMode("complete").start()

How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread Bhaarat Sharma
I am reading bunch of files in PySpark using binaryFiles. Then I want to get the number of bytes for each file and write this number to an HDFS file with the corresponding name. Example: if directory /myimages has one.jpg, two.jpg, and three.jpg then I want three files one-success.jpg,

Dataframe and corresponding RDD return different rows (PySpark)

2016-07-30 Thread Params
Hi, I am facing a weird behavior where the dataframe and the downstream list and map generated from its RDD equivalent seem to be returning different rows. What could be possibly going wrong? Any help is appreciated. Below is a snippet of the code along with the output: NOTE:[1] samples is a

Re: Structured Streaming Parquet Sink

2016-07-30 Thread Tathagata Das
Correction, the two options are. - writeStream.format("parquet").option("path", "...").start() - writestream.parquet("...").start() There no start with param. On Jul 30, 2016 11:22 AM, "Jacek Laskowski" wrote: > Hi Arun, > > > As per documentation, parquet is the only

Dataframe and corresponding RDD return different rows (PySpark)

2016-07-30 Thread parameshr
Hi, I am facing a weird behavior where the dataframe and the downstream list and map generated from its RDD equivalent seem to be returning different rows. What could be possibly going wrong? Any help is appreciated. Below is a snippet of the code along with the output: NOTE: [1] samples is a

Re: Visualization of data analysed using spark

2016-07-30 Thread Rerngvit Yanggratoke
Since you already have an existing application (not starting from scratch), the simplest way to visualize would be to export the data to a file (e.g., a CSV file) and visualise using other tools, e.g., Excel, RStudio, Matlab, Jupiter, Zeppelin, Tableu, Elastic Stack. The choice depends on your

How to filter based on a constant value

2016-07-30 Thread Mich Talebzadeh
Hi, I would like to find out when it was the last time I paid a company with Debit Card This is the way I do it. 1) Find the date when I paid last 2) Find the rest of details from the row(s) So var HASHTAG = "XYZ" scala> var maxdate =

Re: how to order data in descending order in spark dataset

2016-07-30 Thread Mark Wusinich
> ts.groupBy("b").count().orderBy(col("count"), ascending=False) Sent from my iPhone > On Jul 30, 2016, at 2:54 PM, Don Drake wrote: > > Try: > > ts.groupBy("b").count().orderBy(col("count").desc()); > > -Don > >> On Sat, Jul 30, 2016 at 1:30 PM, Tony Lane

Visualization of data analysed using spark

2016-07-30 Thread Tony Lane
I am developing my analysis application by using spark (in eclipse as the IDE) what is a good way to visualize the data, taking into consideration i have multiple files which make up my spark application. I have seen some notebook demo's but not sure how to use my application with such

Re: how to order data in descending order in spark dataset

2016-07-30 Thread Don Drake
Try: ts.groupBy("b").count().orderBy(col("count").desc()); -Don On Sat, Jul 30, 2016 at 1:30 PM, Tony Lane wrote: > just to clarify I am try to do this in java > > ts.groupBy("b").count().orderBy("count"); > > > > On Sun, Jul 31, 2016 at 12:00 AM, Tony Lane

Re: sql to spark scala rdd

2016-07-30 Thread sri hari kali charan Tummala
for knowledge just wondering how to write it up in scala or spark RDD. Thanks Sri On Sat, Jul 30, 2016 at 11:24 AM, Jacek Laskowski wrote: > Why? > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0

Re: how to order data in descending order in spark dataset

2016-07-30 Thread Tony Lane
just to clarify I am try to do this in java ts.groupBy("b").count().orderBy("count"); On Sun, Jul 31, 2016 at 12:00 AM, Tony Lane wrote: > ts.groupBy("b").count().orderBy("count"); > > how can I order this data in descending order of count > Any suggestions > > -Tony

how to order data in descending order in spark dataset

2016-07-30 Thread Tony Lane
ts.groupBy("b").count().orderBy("count"); how can I order this data in descending order of count Any suggestions -Tony

Re: sql to spark scala rdd

2016-07-30 Thread Jacek Laskowski
Why? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat, Jul 30, 2016 at 4:42 AM, kali.tumm...@gmail.com wrote: > Hi All, > > I

Re: Structured Streaming Parquet Sink

2016-07-30 Thread Jacek Laskowski
Hi Arun, > As per documentation, parquet is the only available file sink. The following sinks are currently available in Spark: * ConsoleSink for console format. * FileStreamSink for parquet format. * ForeachSink used in foreach operator. * MemorySink for memory format. You can create your own

Spark SQL - Invalid method name: 'alter_table_with_cascade'

2016-07-30 Thread KhajaAsmath Mohammed
Hi, I am trying to connect retrieve some records from hive existing table. Spark submit script is not able to pull the data from the hive and resulting in below exception. *16/07/30 00:00:40 WARN RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect.*

[SPARK-3586][streaming]Support nested directories in Spark

2016-07-30 Thread 接立骞
hi all: https://github.com/apache/spark/pull/2765 I use spark streaming to monitor nested directories , and I find the discussion listed above, however it is stalled. anyone of you know the reason and how to monitor nested directories thanks in advance

Re: how to copy local files to hdfs quickly?

2016-07-30 Thread Andy Davidson
For lack of a better solution I am using ŒAWS s3 copy¹ to copy my files locally and Œhadoop fs ­put ./tmp/* Œ to transfer them. In general put works much better with a smaller number of big files compared to a large number of small files Your milage may vary Andy From: Andrew Davidson

Re: use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-30 Thread Andy Davidson
Hi Pedro These are much more accurate performance numbers In total I have 5,671,287 rows. Each row was stored in JSON. The JSON is very complicated and can be upto 4k per row I randomly picked 30 partitions. My ³big² files are at most 64M execution timesrccoalesce(num)file sizenum files 34min

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Mich Talebzadeh
ok I am using Spark 1.6.1 and Hive 2 but don't seem to be the issue. The errors look very similar I guess someone from Hive can answer this issue! Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Chanh Le
Hi Mich some thing different between your log > On Jul 30, 2016, at 6:58 PM, Mich Talebzadeh > wrote: > > parquet-mr version 1.6.0 > org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_by: parquet-mr version 1.6.0 using format: (.+)

Structured Streaming Parquet Sink

2016-07-30 Thread Arun Patel
I am trying out Structured streaming parquet sink. As per documentation, parquet is the only available file sink. I am getting an error like 'path' is not specified. scala> val query = streamingCountsDF.writeStream.format("parquet").start() java.lang.IllegalArgumentException: 'path' is not

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Mich Talebzadeh
Yes. Something is wrong even when I query table in Hive with correct data it throws error about corrupt stats before showing the result of 1 row hive> select * from abc limit 1; Jul 30, 2016 12:52:14 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could

Re: Spark 2.0 blocker on windows - spark-warehouse path issue

2016-07-30 Thread Sean Owen
This is https://issues.apache.org/jira/browse/SPARK-15899 On Sat, Jul 30, 2016 at 2:27 AM, Tony Lane wrote: > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > file:C:/ibm/spark-warehouse > > Anybody knows a solution to this? > > cheers > tony >

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Chanh Le
I received this log when recent debug. Is that related to PARQUET-251 But I checked Spark current using parquet 1.8.1 means it already fixed. 16/07/30 18:32:11 INFO SparkExecuteStatementOperation: Running query 'select * from topic18' with 72649e37-3ef4-4acd-8d01-4a28e79a1f9a 16/07/30 18:32:11

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Mich Talebzadeh
Actually Hive SQL is a superset of Spark SQL. Data type may not be an issue. If I create the table after DataFrame creation as explicitly a Hive parquet table through Spark, Hive sees it and you can see it in Spark thrift server with data in it (basically you are using Hive Thrift server under

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Chanh Le
I agree with you. Maybe some change on data type in Spark that Hive still not support or not competitive so that why It shows NULL. > On Jul 30, 2016, at 5:47 PM, Mich Talebzadeh > wrote: > > I think it is still a Hive problem because Spark thrift server is

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Mich Talebzadeh
I think it is still a Hive problem because Spark thrift server is basically a Hive thrift server. An ACID test would be to log in to Hive CLI or Hive thrift server (you are actually using Hive thrift server on port 1 when using Spark thrift server) and see whether you see data When you use

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Chanh Le
Hi Mich, Thanks for supporting. Here some of my thoughts. > BTW can you log in to thrift server and do select * from limit 10 > > Do you see the rows? Yes I can see the row but all the fields value NULL. > Works OK for me You just test the number of row. In my case I check and it shows 117

Spark 2.0 blocker on windows - spark-warehouse path issue

2016-07-30 Thread Tony Lane
Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:C:/ibm/spark-warehouse Anybody knows a solution to this? cheers tony

RE: PySpark 1.6.1: 'builtin_function_or_method' object has no attribute '__code__' in Pickles

2016-07-30 Thread Joaquin Alzola
An example (adding a package to the spark submit): bin/spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.0 spark_v3.py From: Bhaarat Sharma [mailto:bhaara...@gmail.com] Sent: 30 July 2016 06:38 To: ayan guha Cc: user

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Mich Talebzadeh
BTW can you log in to thrift server and do select * from limit 10 Do you see the rows? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Mich Talebzadeh
Works OK for me scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "false").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868") df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, C3: string, C4: string,