Re: Custom RDD: Report Size of Partition in Bytes to Spark

2016-07-03 Thread Takeshi Yamamuro
How about using `SparkListener`? You can collect IO statistics thru TaskMetrics#inputMetrics by yourself. // maropu On Mon, Jul 4, 2016 at 11:46 AM, Pedro Rodriguez wrote: > Hi All, > > I noticed on some Spark jobs it shows you input/output read size. I am >

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-03 Thread Chanh Le
Hi Gene, Could you give some suggestions on that? > On Jul 1, 2016, at 5:31 PM, Ted Yu wrote: > > The comment from zhangxiongfei was from a year ago. > > Maybe something changed since them ? > > On Fri, Jul 1, 2016 at 12:07 AM, Chanh Le

Custom RDD: Report Size of Partition in Bytes to Spark

2016-07-03 Thread Pedro Rodriguez
Hi All, I noticed on some Spark jobs it shows you input/output read size. I am implementing a custom RDD which reads files and would like to report these metrics to Spark since they are available to me. I looked through the RDD source code and a couple different implementations and the best I

Re: Saving parquet table as uncompressed with write.mode("overwrite").

2016-07-03 Thread Mich Talebzadeh
Checked default is gzip Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your

Graphframe Error

2016-07-03 Thread Arun Patel
I started my pyspark shell with command (I am using spark 1.6). bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6 I have copied http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar to the lib directory of Spark as well. I

Re: Saving parquet table as uncompressed with write.mode("overwrite").

2016-07-03 Thread Mich Talebzadeh
thanks Ted that was it :) scala> val c = sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed") c: Unit = () scala> val s4 = s.write.mode("overwrite").parquet("/user/hduser/sales4") s4: Unit = () Before -rw-r--r-- 2 hduser supergroup 17487 2016-07-03 22:28

Re: Saving parquet table as uncompressed with write.mode("overwrite").

2016-07-03 Thread Ted Yu
Have you tried the following (note the extraneous dot in your config name) ? val c = sqlContext.setConf("spark.sql.parquet.compression.codec", "none") Also, parquet() has compression parameter which defaults to None FYI On Sun, Jul 3, 2016 at 2:42 PM, Mich Talebzadeh

Saving parquet table as uncompressed with write.mode("overwrite").

2016-07-03 Thread Mich Talebzadeh
Hi, I simply read a Parquet table scala> val s = sqlContext.read.parquet("oraclehadoop.sales2") s: org.apache.spark.sql.DataFrame = [prod_id: bigint, cust_id: bigint, time_id: timestamp, channel_id: bigint, promo_id: bigint, quantity_sold: decimal(10,0), amount_sold: decimal(10,0)] Now all I

JAr files into python3

2016-07-03 Thread Joaquin Alzola
HI List, I have the following script which will be used in Spark. #!/usr/bin/env python3 from pyspark_cassandra import CassandraSparkContext, Row from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext import os os.environ['CLASSPATH']="/mnt/spark/lib" conf =

Re: Working of Streaming Kmeans

2016-07-03 Thread Biplob Biswas
Hi, Can anyone please explain this? Thanks & Regards Biplob Biswas On Sat, Jul 2, 2016 at 4:48 PM, Biplob Biswas wrote: > Hi, > > I wanted to ask a very basic question about the working of Streaming > Kmeans. > > Does the model update only when training (i.e.

RE: AMQP extension for Apache Spark Streaming (messaging/IoT)

2016-07-03 Thread Darren Govoni
This is fantastic news. Sent from my Verizon 4G LTE smartphone Original message From: Paolo Patierno Date: 7/3/16 4:41 AM (GMT-05:00) To: user@spark.apache.org Subject: AMQP extension for Apache Spark Streaming (messaging/IoT) Hi all, I'm

Re: 'numBins' property not honoured in BinaryClassificationMetrics class when spark.default.parallelism is not set to 1

2016-07-03 Thread Sean Owen
Why do you say it's not honored -- what do you observe? looking at the code, it does not seem to depend on the RDD parallelism. Can you narrow this down to a shorter example? On Wed, Jun 22, 2016 at 5:39 AM, Sneha Shukla wrote: > Hi, > > I'm trying to use the

Re: 'numBins' property not honoured in BinaryClassificationMetrics class when spark.default.parallelism is not set to 1

2016-07-03 Thread sneha29shukla
Hi, Any pointers? I'm not sure if this thread is reaching the right audience? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/numBins-property-not-honoured-in-BinaryClassificationMetrics-class-when-spark-default-parallelism-is1-tp27204p27269.html Sent

Re: Ideas to put a Spark ML model in production

2016-07-03 Thread Alexey Pechorin
>From my personal experience - we're reading the metadata of the features column in the dataframe to extract mapping of the feature indices to the original feature name, and use this mapping to translate the model coefficients into a JSON string that maps the original feature names to their

AMQP extension for Apache Spark Streaming (messaging/IoT)

2016-07-03 Thread Paolo Patierno
Hi all, I'm working on an AMQP extension for Apache Spark Streaming, developing a reliable receiver for that. After MQTT support (I see it in the Apache Bahir repository), another messaging/IoT protocol could be very useful for the Apache Spark Streaming ecosystem. Out there a lot of