RE: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-12 Thread Mohammed Guller
Regarding Spark scheduler – if you are referring to the ability to distribute 
workload and scale, Kafka Streaming also provides that capability. It is 
deceptively simple in that regard if you already have a Kafka cluster. You can 
launch multiple instances of your Kafka streaming application and Kafka 
streaming will automatically balance the workload across different instances. 
It rebalances workload as you add or remove instances. Similarly, if an 
instance fails or crash, it will automatically detect that.

Regarding real-time – rather than debating which one is real-time, I would look 
at the latency requirements of my application. For most applications, the near 
real time capabilities of Spark Streaming might be good enough. For others, it 
may not.  For example, if I was building a high-frequency trading application, 
where I want to process individual trades as soon as they happen, I might lean 
towards Kafka streaming.

Agree about the benefits of using SQL with structured streaming.

Mohammed

From: kant kodali [mailto:kanth...@gmail.com]
Sent: Sunday, June 11, 2017 3:41 PM
To: Mohammed Guller <moham...@glassbeam.com>
Cc: vincent gromakowski <vincent.gromakow...@gmail.com>; yohann jardin 
<yohannjar...@hotmail.com>; vaquar khan <vaquar.k...@gmail.com>; user 
<user@spark.apache.org>
Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?

Also another difference I see is some thing like Spark Sql where there are 
logical plans, physical plans, Code generation and all those optimizations I 
don't see them in Kafka Streaming at this time.

On Sun, Jun 11, 2017 at 2:19 PM, kant kodali 
<kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote:
I appreciate the responses however I see the other side of the argument and I 
actually feel they are competitors now in Streaming space in some sense.

Kafka Streaming can indeed do map, reduce, join and window operations and Like 
wise data can be ingested from many sources in Kafka and send the results out 
to many sinks. Look up "Kafka Connect"

Regarding Event at a time vs Micro-batch. I hear arguments from a group of 
people saying Spark Streaming is real time and other group of people is Kafka 
streaming is the true real time. so do we say Micro-batch is real time or Event 
at a time is real time?

It is well known fact that Spark is more popular with Data scientists who want 
to run ML Algorithms and so on but I also hear that people can use H2O package 
along with Kafka Streaming. so efficient each of these approaches are is 
something I have no clue.

The major difference I see is actually the Spark Scheduler I don't think Kafka 
Streaming has anything like this instead it just allows you to run lambda 
expressions on a stream and write it out to specific topic/partition and from 
there one can use Kafka Connect to write it out to any sink. so In short, All 
the optimizations built into spark scheduler don't seem to exist in Kafka 
Streaming so if I were to make a decision on which framework to use this is an 
additional question I would think about like "Do I want my stream to go through 
the scheduler and if so, why or why not"

Above all, please correct me if I am wrong :)




On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Just to elaborate more on Vincent wrote – Kafka streaming provides true 
record-at-a-time processing capabilities whereas Spark Streaming provides 
micro-batching capabilities on top of Spark. Depending on your use case, you 
may find one better than the other. Both provide stateless ad stateful stream 
processing capabilities.

A few more things to consider:

  1.  If you don’t already have a Spark cluster, but have Kafka cluster, it may 
be easier to use Kafka streaming since you don’t need to setup and manage 
another cluster.
  2.  On the other hand, if you already have a spark cluster, but don’t have a 
Kafka cluster (in case you are using some other messaging system), Spark 
streaming is a better option.
  3.  If you already know and use Spark, you may find it easier to program with 
Spark Streaming API even if you are using Kafka.
  4.  Spark Streaming may give you better throughput. So you have to decide 
what is more important for your stream processing application – latency or 
throughput?
  5.  Kafka streaming is relatively new and less mature than Spark Streaming

Mohammed

From: vincent gromakowski 
[mailto:vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>]
Sent: Sunday, June 11, 2017 12:09 PM
To: yohann jardin <yohannjar...@hotmail.com<mailto:yohannjar...@hotmail.com>>
Cc: kant kodali <kanth...@gmail.com<mailto:kanth...@gmail.com>>; vaquar khan 
<vaquar.k...@gmail.com<mailto:vaquar.k...@gmail.com>>; user 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: What

RE: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread Mohammed Guller
Just to elaborate more on Vincent wrote – Kafka streaming provides true 
record-at-a-time processing capabilities whereas Spark Streaming provides 
micro-batching capabilities on top of Spark. Depending on your use case, you 
may find one better than the other. Both provide stateless ad stateful stream 
processing capabilities.

A few more things to consider:

  1.  If you don’t already have a Spark cluster, but have Kafka cluster, it may 
be easier to use Kafka streaming since you don’t need to setup and manage 
another cluster.
  2.  On the other hand, if you already have a spark cluster, but don’t have a 
Kafka cluster (in case you are using some other messaging system), Spark 
streaming is a better option.
  3.  If you already know and use Spark, you may find it easier to program with 
Spark Streaming API even if you are using Kafka.
  4.  Spark Streaming may give you better throughput. So you have to decide 
what is more important for your stream processing application – latency or 
throughput?
  5.  Kafka streaming is relatively new and less mature than Spark Streaming

Mohammed

From: vincent gromakowski [mailto:vincent.gromakow...@gmail.com]
Sent: Sunday, June 11, 2017 12:09 PM
To: yohann jardin 
Cc: kant kodali ; vaquar khan ; user 

Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?

I think Kafka streams is good when the processing of each row is independant 
from each other (row parsing, data cleaning...)
Spark is better when processing group of rows (group by, ml, window func...)

Le 11 juin 2017 8:15 PM, "yohann jardin" 
> a écrit :

Hey,
Kafka can also do streaming on its own: 
https://kafka.apache.org/documentation/streams
I don’t know much about it unfortunately. I can only repeat what I heard in 
conferences, saying that one should give a try to Kafka streaming when its 
whole pipeline is using Kafka. I have no pros/cons to argument on this topic.

Yohann Jardin
Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark 
Streaming is used as the real time processing ,Kafka and Spark Streaming work 
together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching for 
streaming data, In easy terms collects data for some time, build RDD and then 
process these micro batches.


Please read doc : 
https://spark.apache.org/docs/latest/streaming-programming-guide.html


Spark Streaming is an extension of the core Spark API that enables scalable, 
high-throughput, fault-tolerant stream processing of live data streams. Data 
can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, 
and can be processed using complex algorithms expressed with high-level 
functions like map, reduce, join and window. Finally, processed data can be 
pushed out to filesystems, databases, and live dashboards. In fact, you can 
apply Spark’s machine 
learning and graph 
processing 
algorithms on data streams.


Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali 
> wrote:
Hi All,

I am trying hard to figure out what is the real difference between Kafka 
Streaming vs Spark Streaming other than saying one can be used as part of Micro 
services (since Kafka streaming is just a library) and the other is a 
Standalone framework by itself.

If I can accomplish same job one way or other this is a sort of a puzzling 
question for me so it would be great to know what Spark streaming can do that 
Kafka Streaming cannot do efficiently or whatever ?

Thanks!




--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago




RE: Explanation regarding Spark Streaming

2016-08-06 Thread Mohammed Guller
According to the docs for Spark Streaming, the default for data received 
through receivers is MEMORY_AND_DISK_SER_2. If windowing operations are 
performed, RDDs are persisted with StorageLevel.MEMORY_ONLY_SER.

http://spark.apache.org/docs/latest/streaming-programming-guide.html#data-serialization

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Saturday, August 6, 2016 12:25 PM
To: Mohammed Guller
Cc: Jacek Laskowski; Saurav Sinha; user
Subject: Re: Explanation regarding Spark Streaming

Hi,

I think the default storage level 
<http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence> is 
MEMORY_ONLY

HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 6 August 2016 at 18:16, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Hi Jacek,
Yes, I am assuming that data streams in consistently at the same rate (for 
example, 100MB/s).

BTW, even if the persistence level for streaming data is set to 
MEMORY_AND_DISK_SER_2 (the default), once Spark runs out of memory, data will 
spill to  disk. That will make the application performance even worse.

Mohammed

From: Jacek Laskowski [mailto:ja...@japila.pl<mailto:ja...@japila.pl>]
Sent: Saturday, August 6, 2016 1:54 AM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: RE: Explanation regarding Spark Streaming


Hi,

Thanks for explanation, but it does not prove Spark will OOM at some point. You 
assume enough data to store but there could be none.

Jacek

On 6 Aug 2016 4:23 a.m., "Mohammed Guller" 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Assume the batch interval is 10 seconds and batch processing time is 30 
seconds. So while Spark Streaming is processing the first batch, the receiver 
will have a backlog of 20 seconds worth of data. By the time Spark Streaming 
finishes batch #2, the receiver will have 40 seconds worth of data in memory 
buffer. This backlog will keep growing as time passes assuming data streams in 
consistently at the same rate.

Also keep in mind that windowing operations on a DStream implicitly persist 
every RDD in a DStream in memory.

Mohammed

-Original Message-
From: Jacek Laskowski [mailto:ja...@japila.pl<mailto:ja...@japila.pl>]
Sent: Thursday, August 4, 2016 4:25 PM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: Re: Explanation regarding Spark Streaming

On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
> and eventually you will run out of memory.

Why? Mind elaborating?

Jacek



RE: Explanation regarding Spark Streaming

2016-08-06 Thread Mohammed Guller
Hi Jacek,
Yes, I am assuming that data streams in consistently at the same rate (for 
example, 100MB/s).

BTW, even if the persistence level for streaming data is set to 
MEMORY_AND_DISK_SER_2 (the default), once Spark runs out of memory, data will 
spill to  disk. That will make the application performance even worse.

Mohammed

From: Jacek Laskowski [mailto:ja...@japila.pl]
Sent: Saturday, August 6, 2016 1:54 AM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: RE: Explanation regarding Spark Streaming


Hi,

Thanks for explanation, but it does not prove Spark will OOM at some point. You 
assume enough data to store but there could be none.

Jacek

On 6 Aug 2016 4:23 a.m., "Mohammed Guller" 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Assume the batch interval is 10 seconds and batch processing time is 30 
seconds. So while Spark Streaming is processing the first batch, the receiver 
will have a backlog of 20 seconds worth of data. By the time Spark Streaming 
finishes batch #2, the receiver will have 40 seconds worth of data in memory 
buffer. This backlog will keep growing as time passes assuming data streams in 
consistently at the same rate.

Also keep in mind that windowing operations on a DStream implicitly persist 
every RDD in a DStream in memory.

Mohammed

-Original Message-
From: Jacek Laskowski [mailto:ja...@japila.pl<mailto:ja...@japila.pl>]
Sent: Thursday, August 4, 2016 4:25 PM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: Re: Explanation regarding Spark Streaming

On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
> and eventually you will run out of memory.

Why? Mind elaborating?

Jacek


RE: Explanation regarding Spark Streaming

2016-08-05 Thread Mohammed Guller
Assume the batch interval is 10 seconds and batch processing time is 30 
seconds. So while Spark Streaming is processing the first batch, the receiver 
will have a backlog of 20 seconds worth of data. By the time Spark Streaming 
finishes batch #2, the receiver will have 40 seconds worth of data in memory 
buffer. This backlog will keep growing as time passes assuming data streams in 
consistently at the same rate.

Also keep in mind that windowing operations on a DStream implicitly persist 
every RDD in a DStream in memory.

Mohammed

-Original Message-
From: Jacek Laskowski [mailto:ja...@japila.pl] 
Sent: Thursday, August 4, 2016 4:25 PM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: Re: Explanation regarding Spark Streaming

On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller <moham...@glassbeam.com> wrote:
> and eventually you will run out of memory.

Why? Mind elaborating?

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Explanation regarding Spark Streaming

2016-08-04 Thread Mohammed Guller
The backlog will increase as time passes and eventually you will run out of 
memory.

Mohammed
Author: Big Data Analytics with 
Spark

From: Saurav Sinha [mailto:sauravsinh...@gmail.com]
Sent: Wednesday, August 3, 2016 11:57 PM
To: user
Subject: Explanation regarding Spark Streaming

Hi,

I have query

Q1. What will happen if spark streaming job have batchDurationTime as 60 sec 
and processing time of complete pipeline is greater then 60 sec.

--
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


RE: Spark SQL driver memory keeps rising

2016-06-16 Thread Mohammed Guller
I haven’t read the code yet, but when you invoke spark-submit, where are you 
specifying --master yarn --deploy-mode client? Is it in the default config file 
and are you sure that spark-submit is reading the right file?

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Khaled Hammouda [mailto:khaled.hammo...@kik.com]
Sent: Thursday, June 16, 2016 11:45 AM
To: Mohammed Guller
Cc: user
Subject: Re: Spark SQL driver memory keeps rising

I'm using pyspark and running in YARN client mode. I managed to anonymize the 
code a bit and pasted it below.

You'll notice that I don't collect any output in the driver, instead the data 
is written to parquet directly. Also notice that I increased 
spark.driver.maxResultSize to 10g because the job was failing with the error 
"serialized results of x tasks () is bigger than spark.driver.maxResultSize 
(xxx)", which means the tasks were sending something to the driver, and that's 
probably what's causing the driver memory usage to keep rising. This happens at 
stages that read/write shuffled data (as opposed to input stages).

I also noticed this message appear many times in the output "INFO: 
MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 
4 to ip-xx-xx-xx-xx.ec2.internal:49490". I'm not sure if it's relevant.

Cluster setup:
- AWS EMR 4.7.1 using Spark 1.6.1
- 200 nodes (r3.2xlarge, 61 GB memory), and the master node is r3.4xlarge (122 
GB memory)
- 1.4 TB json logs split across ~70k files in S3


# spark-submit --driver-memory 100g --executor-memory 44g --executor-cores 8 
--conf spark.yarn.executor.memoryOverhead=9000 --num-executors 199 
log_cleansing.py

conf = SparkConf()
conf.setAppName('log_cleansing') \
.set("spark.driver.maxResultSize", "10g") \
.set("spark.hadoop.spark.sql.parquet.output.committer.class", 
"org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter")
 \
.set("spark.hadoop.mapred.output.committer.class", 
"org.apache.hadoop.mapred.DirectFileOutputCommitter") \
.set("spark.hadoop.mapreduce.use.directfileoutputcommitter", "true")
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)
sqlContext.setConf("spark.sql.shuffle.partitions", "3600")

# raw data is stored in s3 and is split across 10s of thousands of files
# (each file is about 20MB, total data set size is around 1.4 TB)
# so we coalesce to a manageble number of partitions and store it in hdfs
text = sc.textFile("s3://json_logs/today/*/*").coalesce(400)
text.saveAsTextFile("hdfs:///json_logs/today")

# now read the json data set from hdfs
json_logs = sqlContext.read.json("hdfs:///json_logs/today")
sqlContext.registerDataFrameAsTable(json_logs, "json_logs")

# this cleans the date and only pulls in events within two days of their 
ingestion ts, elimiated pure duplicates
clean_date_and_gap = sqlContext.sql(
 '''SELECT DISTINCT t.*
FROM json_logs t
WHERE datediff(to_utc_timestamp(`ingestionTime`, 'UTC'), 
to_utc_timestamp(`timestamp`, 'UTC')) < 2
  AND to_utc_timestamp(`timestamp`, 'UTC') IS NOT NULL''')
sqlContext.registerDataFrameAsTable(clean_date_and_gap, "clean_date_and_gap")
clean_date_and_gap.cache()

# extract unique event instanceIds and their min ts
distinct_intanceIds = sqlContext.sql(
 '''SELECT
  instanceId,
  min(to_utc_timestamp(`ingestionTime`, 'UTC')) min_ingestion_ts
FROM clean_date_and_gap
GROUP BY instanceId''')
sqlContext.registerDataFrameAsTable(distinct_intanceIds, "distinct_intanceIds")

# create table with only the min ingestion ts records
deduped_day = sqlContext.sql(
 '''SELECT t.*
FROM clean_date_and_gap t
JOIN distinct_intanceIds d
  ON (d.instanceId = t.instanceId AND
  to_utc_timestamp(`ingestionTime`, 'UTC') = d.min_ingestion_ts)''')

# drop weird columns
deduped_day = deduped_day.drop('weird column name1') \
 .drop('weird column name2') \
 .drop('weird column name3')

# standardize all column names so that Parquet is happy
# rename() adds clean alias names to hide problematic column names
oldCols = deduped_day.schema.names
deduped_day_1 = deduped_day.select(*(rename(oldCols)))

sqlContext.registerDataFrameAsTable(deduped_day_1, 'deduped_day_1')
deduped_day_1.cache()

clean_date_and_gap.unpersist()

# load prior days isntanceids for deduping purpose
last1_instanceIds = 
sqlContext.read.parquet("s3://unique_instanceids/today_minus_1day/*")
last2_instanceIds = 
sqlContext.read.parquet("s3://unique_instanceids/today_minus_2day/*")
prior_instanceIds = 
last1_instanceIds.unionAll(last2_instanceIds).distinct().cache()
sqlContext.registerDataFrameAsTable(prior_instanceIds, 'prior_instanceIds')

# determine

RE: Spark SQL driver memory keeps rising

2016-06-15 Thread Mohammed Guller
It would be hard to guess what could be going on without looking at the code. 
It looks like the driver program goes into a long stop-the-world GC pause. This 
should not happen on the machine running the driver program if all that you are 
doing is reading data from HDFS, perform a bunch of transformations and write 
result back into HDFS.

Perhaps, the program is not actually using Spark in cluster mode, but running 
Spark in local mode?

Mohammed
Author: Big Data Analytics with 
Spark

From: Khaled Hammouda [mailto:khaled.hammo...@kik.com]
Sent: Tuesday, June 14, 2016 10:23 PM
To: user
Subject: Spark SQL driver memory keeps rising

I'm having trouble with a Spark SQL job in which I run a series of SQL 
transformations on data loaded from HDFS.

The first two stages load data from hdfs input without issues, but later stages 
that require shuffles cause the driver memory to keep rising until it is 
exhausted, and then the driver stalls, the spark UI stops responding, and the I 
can't even kill the driver with ^C, I have to forcibly kill the process.

I think I'm allocating enough memory to the driver: driver memory is 44 GB, and 
spark.driver.memoryOverhead is 4.5 GB. When I look at the memory usage, the 
driver memory before the shuffle starts is at about 2.4 GB (virtual mem size 
for the driver process is about 50 GB), and then once the stages that require 
shuffle start I can see the driver memory rising fast to about 47 GB, then 
everything stops responding.

I'm not invoking any output operation that collects data at the driver. I just 
call .cache() on a couple of dataframes since they get used more than once in 
the SQL transformations, but those should be cached on the workers. Then I 
write the final result to a parquet file, but the job doesn't get to this final 
stage.

What could possibly be causing the driver memory to rise that fast when no data 
is being collected at the driver?

Thanks,
Khaled


RE: concat spark dataframes

2016-06-15 Thread Mohammed Guller
Top of my head, I can think of the zip operation that RDD provides. So for 
example, if you have two DataFrames df1 and df2, you could do something like 
this:

val newDF = df1.rdd.zip(df2.rdd).map { case(rowFromDf1, rowFromDf2) => 
()}.toDF(...)

Couple of things to keep in mind:

1)  Both df1 and df2 should have the same number of rows.

2)  You are assuming that row N from df1 is related to row N from df2.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: spR [mailto:data.smar...@gmail.com]
Sent: Wednesday, June 15, 2016 4:08 PM
To: Mohammed Guller
Cc: Natu Lauchande; user
Subject: Re: concat spark dataframes

Hey,

There are quite a lot of fields. But, there are no common fields between the 2 
dataframes. Can I not concatenate the 2 frames like we can do in pandas such 
that the resulting dataframe has columns from both the dataframes?

Thank you.

Regards,
Misha



On Wed, Jun 15, 2016 at 3:44 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Hi Misha,
What is the schema for both the DataFrames? And what is the expected schema of 
the resulting DataFrame?

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Natu Lauchande [mailto:nlaucha...@gmail.com<mailto:nlaucha...@gmail.com>]
Sent: Wednesday, June 15, 2016 2:07 PM
To: spR
Cc: user
Subject: Re: concat spark dataframes

Hi,
You can select the common collumns and use DataFrame.union all .
Regards,
Natu

On Wed, Jun 15, 2016 at 8:57 PM, spR 
<data.smar...@gmail.com<mailto:data.smar...@gmail.com>> wrote:
hi,

how to concatenate spark dataframes? I have 2 frames with certain columns. I 
want to get a dataframe with columns from both the other frames.

Regards,
Misha




RE: Spark 2.0 release date

2016-06-15 Thread Mohammed Guller
Andy – instead of Naïve Bayes, you should have used the Multi-layer Perceptron 
classifier ☺

Mohammed
Author: Big Data Analytics with 
Spark

From: andy petrella [mailto:andy.petre...@gmail.com]
Sent: Wednesday, June 15, 2016 7:57 AM
To: Ted Yu
Cc: Mich Talebzadeh; Chaturvedi Chola; user @spark
Subject: Re: Spark 2.0 release date

Yeah well... the prior was high... but don't have enough data on Mich to have 
an accurate likelihood :-)
But ok, my bad, I continue with the preview stuff and leave this thread in 
peace ^^
tx ted
cheers

On Wed, Jun 15, 2016 at 4:47 PM Ted Yu 
> wrote:
Andy:
You should sense the tone in Mich's response.

To my knowledge, there hasn't been an RC for the 2.0 release yet.
Once we have an RC, it goes through the normal voting process.

FYI

On Wed, Jun 15, 2016 at 7:38 AM, andy petrella 
> wrote:
> tomorrow lunch time
Which TZ :-) → I'm working on the update of some materials that Dean Wampler 
and myself will give tomorrow at Scala 
Days
 (well tomorrow CEST).

Hence, I'm upgrading the materials on spark 2.0.0-preview, do you think 2.0.0 
will be released before 6PM CEST (9AM PDT)? I don't want to be a joke in front 
of the audience with my almost cutting edge version :-P

tx


On Wed, Jun 15, 2016 at 3:59 PM Mich Talebzadeh 
> wrote:
Tomorrow lunchtime.

Btw can you stop spamming every big data forum about good interview questions 
book for big data!

I have seen your mails on this big data book in spark, hive and tez forums and 
I am sure there are many others. That seems to be the only mail you send around.

This forum is for technical discussions not for promotional material. Please 
confine yourself to technical matters






Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 15 June 2016 at 12:45, Chaturvedi Chola 
> wrote:
when is the spark 2.0 release planned

--
andy

--
andy


RE: concat spark dataframes

2016-06-15 Thread Mohammed Guller
Hi Misha,
What is the schema for both the DataFrames? And what is the expected schema of 
the resulting DataFrame?

Mohammed
Author: Big Data Analytics with 
Spark

From: Natu Lauchande [mailto:nlaucha...@gmail.com]
Sent: Wednesday, June 15, 2016 2:07 PM
To: spR
Cc: user
Subject: Re: concat spark dataframes

Hi,
You can select the common collumns and use DataFrame.union all .
Regards,
Natu

On Wed, Jun 15, 2016 at 8:57 PM, spR 
> wrote:
hi,

how to concatenate spark dataframes? I have 2 frames with certain columns. I 
want to get a dataframe with columns from both the other frames.

Regards,
Misha



RE: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Mohammed Guller
Vertica also provides a Spark connector. It was not GA the last time I looked 
at it, but available on the Vertica community site. Have you tried using the 
Vertica Spark connector instead of the JDBC driver?

Mohammed
Author: Big Data Analytics with 
Spark

From: Aaron Ilovici [mailto:ailov...@wayfair.com]
Sent: Thursday, May 26, 2016 8:08 AM
To: u...@spark.apache.org; dev@spark.apache.org
Subject: JDBC Dialect for saving DataFrame into Vertica Table

I am attempting to write a DataFrame of Rows to Vertica via DataFrameWriter's 
jdbc function in the following manner:

dataframe.write().mode(SaveMode.Append).jdbc(url, table, properties);

This works when there are no NULL values in any of the Rows in my DataFrame. 
However, when there are rows, I get the following error:

ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 24)
java.sql.SQLFeatureNotSupportedException: [Vertica][JDBC](10220) Driver not 
capable.
at com.vertica.exceptions.ExceptionConverter.toSQLException(Unknown Source)
at com.vertica.jdbc.common.SPreparedStatement.checkTypeSupported(Unknown 
Source)
at com.vertica.jdbc.common.SPreparedStatement.setNull(Unknown Source)

This appears to be Spark's attempt to set a null value in a PreparedStatement, 
but Vertica does not understand the type upon executing the transaction. I see 
in JdbcDialects.scala that there are dialects for MySQL, Postgres, DB2, 
MsSQLServer, Derby, and Oracle.

1 - Would writing a dialect for Vertica eleviate the issue, by setting a 'NULL' 
in a type that Vertica would understand?
2 - What would be the best way to do this without a Spark patch? Scala, Java, 
make a jar and call 'JdbcDialects.registerDialect(VerticaDialect)' once created?
3 - Where would one find the proper mapping between Spark DataTypes and Vertica 
DataTypes? I don't see 'NULL' handling for any of the dialects, only the base 
case 'case _ => None' - is None mapped to the proper NULL type elsewhere?

My environment: Spark 1.6, Vertica Driver 7.2.2, Java 1.7

I would be happy to create a Jira and submit a pull request with the 
VerticaDialect once I figure this out.

Thank you for any insight on this,

AARON ILOVICI
Software Engineer
Marketing Engineering

[cid:image001.png@01D1B760.973BD800]

WAYFAIR
4 Copley Place
Boston, MA 02116
(617) 532-6100 x1231
ailov...@wayfair.com




RE: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Mohammed Guller
Vertica also provides a Spark connector. It was not GA the last time I looked 
at it, but available on the Vertica community site. Have you tried using the 
Vertica Spark connector instead of the JDBC driver?

Mohammed
Author: Big Data Analytics with 
Spark

From: Aaron Ilovici [mailto:ailov...@wayfair.com]
Sent: Thursday, May 26, 2016 8:08 AM
To: user@spark.apache.org; d...@spark.apache.org
Subject: JDBC Dialect for saving DataFrame into Vertica Table

I am attempting to write a DataFrame of Rows to Vertica via DataFrameWriter's 
jdbc function in the following manner:

dataframe.write().mode(SaveMode.Append).jdbc(url, table, properties);

This works when there are no NULL values in any of the Rows in my DataFrame. 
However, when there are rows, I get the following error:

ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 24)
java.sql.SQLFeatureNotSupportedException: [Vertica][JDBC](10220) Driver not 
capable.
at com.vertica.exceptions.ExceptionConverter.toSQLException(Unknown Source)
at com.vertica.jdbc.common.SPreparedStatement.checkTypeSupported(Unknown 
Source)
at com.vertica.jdbc.common.SPreparedStatement.setNull(Unknown Source)

This appears to be Spark's attempt to set a null value in a PreparedStatement, 
but Vertica does not understand the type upon executing the transaction. I see 
in JdbcDialects.scala that there are dialects for MySQL, Postgres, DB2, 
MsSQLServer, Derby, and Oracle.

1 - Would writing a dialect for Vertica eleviate the issue, by setting a 'NULL' 
in a type that Vertica would understand?
2 - What would be the best way to do this without a Spark patch? Scala, Java, 
make a jar and call 'JdbcDialects.registerDialect(VerticaDialect)' once created?
3 - Where would one find the proper mapping between Spark DataTypes and Vertica 
DataTypes? I don't see 'NULL' handling for any of the dialects, only the base 
case 'case _ => None' - is None mapped to the proper NULL type elsewhere?

My environment: Spark 1.6, Vertica Driver 7.2.2, Java 1.7

I would be happy to create a Jira and submit a pull request with the 
VerticaDialect once I figure this out.

Thank you for any insight on this,

AARON ILOVICI
Software Engineer
Marketing Engineering

[cid:image001.png@01D1B760.973BD800]

WAYFAIR
4 Copley Place
Boston, MA 02116
(617) 532-6100 x1231
ailov...@wayfair.com




RE: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Mohammed Guller
As Ben mentioned, Spark 1.5.2 does work with C*.  Make sure that you are using 
the correct version of the Spark Cassandra Connector.


Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Ben Slater [mailto:ben.sla...@instaclustr.com]
Sent: Tuesday, May 17, 2016 11:00 PM
To: user@cassandra.apache.org; Mohammed Guller
Cc: user
Subject: Re: Accessing Cassandra data from Spark Shell

It definitely should be possible for 1.5.2 (I have used it with spark-shell and 
cassandra connector with 1.4.x). The main trick is in lining up all the 
versions and building an appropriate connector jar.

Cheers
Ben

On Wed, 18 May 2016 at 15:40 Cassa L 
<lcas...@gmail.com<mailto:lcas...@gmail.com>> wrote:
Hi,
I followed instructions to run SparkShell with Spark-1.6. It works fine. 
However, I need to use spark-1.5.2 version. With it, it does not work. I keep 
getting NoSuchMethod Errors. Is there any issue running Spark Shell for 
Cassandra using older version of Spark?


Regards,
LCassa

On Tue, May 10, 2016 at 6:48 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Yes, it is very simple to access Cassandra data using Spark shell.

Step 1: Launch the spark-shell with the spark-cassandra-connector package
$SPARK_HOME/bin/spark-shell --packages 
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0

Step 2: Create a DataFrame pointing to your Cassandra table
val dfCassTable = sqlContext.read
 
.format("org.apache.spark.sql.cassandra")
 .options(Map( "table" 
-> "your_column_family", "keyspace" -> "your_keyspace"))
 .load()

From this point onward, you have complete access to the DataFrame API. You can 
even register it as a temporary table, if you would prefer to use SQL/HiveQL.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Ben Slater 
[mailto:ben.sla...@instaclustr.com<mailto:ben.sla...@instaclustr.com>]
Sent: Monday, May 9, 2016 9:28 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>; user
Subject: Re: Accessing Cassandra data from Spark Shell

You can use SparkShell to access Cassandra via the Spark Cassandra connector. 
The getting started article on our support page will probably give you a good 
steer to get started even if you’re not using Instaclustr: 
https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-

Cheers
Ben

On Tue, 10 May 2016 at 14:08 Cassa L 
<lcas...@gmail.com<mailto:lcas...@gmail.com>> wrote:
Hi,
Has anyone tried accessing Cassandra data using SparkShell? How do you do it? 
Can you use HiveContext for Cassandra data? I'm using community version of 
Cassandra-3.0

Thanks,
LCassa
--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798<tel:%2B61%20437%20929%20798>

--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798


RE: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Mohammed Guller
As Ben mentioned, Spark 1.5.2 does work with C*.  Make sure that you are using 
the correct version of the Spark Cassandra Connector.


Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Ben Slater [mailto:ben.sla...@instaclustr.com]
Sent: Tuesday, May 17, 2016 11:00 PM
To: u...@cassandra.apache.org; Mohammed Guller
Cc: user
Subject: Re: Accessing Cassandra data from Spark Shell

It definitely should be possible for 1.5.2 (I have used it with spark-shell and 
cassandra connector with 1.4.x). The main trick is in lining up all the 
versions and building an appropriate connector jar.

Cheers
Ben

On Wed, 18 May 2016 at 15:40 Cassa L 
<lcas...@gmail.com<mailto:lcas...@gmail.com>> wrote:
Hi,
I followed instructions to run SparkShell with Spark-1.6. It works fine. 
However, I need to use spark-1.5.2 version. With it, it does not work. I keep 
getting NoSuchMethod Errors. Is there any issue running Spark Shell for 
Cassandra using older version of Spark?


Regards,
LCassa

On Tue, May 10, 2016 at 6:48 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Yes, it is very simple to access Cassandra data using Spark shell.

Step 1: Launch the spark-shell with the spark-cassandra-connector package
$SPARK_HOME/bin/spark-shell --packages 
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0

Step 2: Create a DataFrame pointing to your Cassandra table
val dfCassTable = sqlContext.read
 
.format("org.apache.spark.sql.cassandra")
 .options(Map( "table" 
-> "your_column_family", "keyspace" -> "your_keyspace"))
 .load()

From this point onward, you have complete access to the DataFrame API. You can 
even register it as a temporary table, if you would prefer to use SQL/HiveQL.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Ben Slater 
[mailto:ben.sla...@instaclustr.com<mailto:ben.sla...@instaclustr.com>]
Sent: Monday, May 9, 2016 9:28 PM
To: u...@cassandra.apache.org<mailto:u...@cassandra.apache.org>; user
Subject: Re: Accessing Cassandra data from Spark Shell

You can use SparkShell to access Cassandra via the Spark Cassandra connector. 
The getting started article on our support page will probably give you a good 
steer to get started even if you’re not using Instaclustr: 
https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-

Cheers
Ben

On Tue, 10 May 2016 at 14:08 Cassa L 
<lcas...@gmail.com<mailto:lcas...@gmail.com>> wrote:
Hi,
Has anyone tried accessing Cassandra data using SparkShell? How do you do it? 
Can you use HiveContext for Cassandra data? I'm using community version of 
Cassandra-3.0

Thanks,
LCassa
--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798<tel:%2B61%20437%20929%20798>

--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798


RE: Accessing Cassandra data from Spark Shell

2016-05-10 Thread Mohammed Guller
Yes, it is very simple to access Cassandra data using Spark shell.

Step 1: Launch the spark-shell with the spark-cassandra-connector package
$SPARK_HOME/bin/spark-shell --packages 
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0

Step 2: Create a DataFrame pointing to your Cassandra table
val dfCassTable = sqlContext.read
 
.format("org.apache.spark.sql.cassandra")
 .options(Map( "table" 
-> "your_column_family", "keyspace" -> "your_keyspace"))
 .load()

From this point onward, you have complete access to the DataFrame API. You can 
even register it as a temporary table, if you would prefer to use SQL/HiveQL.

Mohammed
Author: Big Data Analytics with 
Spark

From: Ben Slater [mailto:ben.sla...@instaclustr.com]
Sent: Monday, May 9, 2016 9:28 PM
To: user@cassandra.apache.org; user
Subject: Re: Accessing Cassandra data from Spark Shell

You can use SparkShell to access Cassandra via the Spark Cassandra connector. 
The getting started article on our support page will probably give you a good 
steer to get started even if you’re not using Instaclustr: 
https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-

Cheers
Ben

On Tue, 10 May 2016 at 14:08 Cassa L 
> wrote:
Hi,
Has anyone tried accessing Cassandra data using SparkShell? How do you do it? 
Can you use HiveContext for Cassandra data? I'm using community version of 
Cassandra-3.0

Thanks,
LCassa
--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798


RE: Accessing Cassandra data from Spark Shell

2016-05-10 Thread Mohammed Guller
Yes, it is very simple to access Cassandra data using Spark shell.

Step 1: Launch the spark-shell with the spark-cassandra-connector package
$SPARK_HOME/bin/spark-shell --packages 
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0

Step 2: Create a DataFrame pointing to your Cassandra table
val dfCassTable = sqlContext.read
 
.format("org.apache.spark.sql.cassandra")
 .options(Map( "table" 
-> "your_column_family", "keyspace" -> "your_keyspace"))
 .load()

From this point onward, you have complete access to the DataFrame API. You can 
even register it as a temporary table, if you would prefer to use SQL/HiveQL.

Mohammed
Author: Big Data Analytics with 
Spark

From: Ben Slater [mailto:ben.sla...@instaclustr.com]
Sent: Monday, May 9, 2016 9:28 PM
To: u...@cassandra.apache.org; user
Subject: Re: Accessing Cassandra data from Spark Shell

You can use SparkShell to access Cassandra via the Spark Cassandra connector. 
The getting started article on our support page will probably give you a good 
steer to get started even if you’re not using Instaclustr: 
https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-

Cheers
Ben

On Tue, 10 May 2016 at 14:08 Cassa L 
> wrote:
Hi,
Has anyone tried accessing Cassandra data using SparkShell? How do you do it? 
Can you use HiveContext for Cassandra data? I'm using community version of 
Cassandra-3.0

Thanks,
LCassa
--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798


RE: Reading table schema from Cassandra

2016-05-10 Thread Mohammed Guller
You can create a DataFrame directly from a Cassandra table using something like 
this:

val dfCassTable = 
sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" 
-> "your_column_family", "keyspace" -> "your_keyspace")).load()

Then, you can get schema:
val dfCassTableSchema = dfCassTable.schema

Mohammed
Author: Big Data Analytics with Spark


-Original Message-
From: justneeraj [mailto:justnee...@gmail.com] 
Sent: Tuesday, May 10, 2016 2:22 AM
To: user@spark.apache.org
Subject: Reading table schema from Cassandra

Hi,

We are using Spark Cassandra connector for our app. 

And I am trying to create higher level roll up tables. e.g. minutes table from 
seconds table. 

If my tables are already defined. How can I read the schema of table?
So that I can load them in the Dataframe and create the aggregates. 

Any help will be really thankful. 

Thanks,
Neeraj 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-table-schema-from-Cassandra-tp26915.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark standalone workers, executors and JVMs

2016-05-04 Thread Mohammed Guller
Spark allows you configure the resources for the worker process. If I remember 
it correctly, you can use SPARK_DAEMON_MEMORY to control memory allocated to 
the worker process.

#1 below is more appropriate if you will be running just one application at a 
time. 32GB heap size is still too high. Depending on the garbage collector, you 
may see long pauses.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Simone Franzini [mailto:captainfr...@gmail.com]
Sent: Wednesday, May 4, 2016 7:40 AM
To: user
Subject: Re: Spark standalone workers, executors and JVMs

Hi Mohammed,

Thanks for your reply. I agree with you, however a single application can use 
multiple executors as well, so I am still not clear which option is best. Let 
me make an example to be a little more concrete.

Let's say I am only running a single application. Let's assume again that I 
have 192GB of memory and 24 cores on each node. Which one of the following two 
options is best and why:
1. Running 6 workers with 32GB each and 1 executor/worker (i.e. set 
SPARK_WORKER_INSTANCES=6, leave spark.executor.cores to its default, which is 
to assign all available cores to an executor in standalone mode).
2. Running 1 worker with 192GB memory and 6 executors/worker (i.e. 
SPARK_WORKER_INSTANCES=1 and spark.executor.cores=5, 
spark.executor.memory=32GB).

Also one more question. I understand that workers and executors are different 
processes. How many resources is the worker process actually using and how do I 
set those? As far as I understand the worker does not need many resources, as 
it is only spawning up executors. Is that correct?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Mon, May 2, 2016 at 7:47 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
The workers and executors run as separate JVM processes in the standalone mode.

The use of multiple workers on a single machine depends on how you will be 
using the clusters. If you run multiple Spark applications simultaneously, each 
application gets its own its executor. So, for example, if you allocate 8GB to 
each application, you can run 192/8 Spark applications simultaneously (assuming 
you also have large number of cores). Each executor has only 8GB heap, so GC 
should not be issue. Alternatively, if you know that you will have few 
applications running simultaneously on that cluster, running multiple workers 
on each machine will allow you to avoid GC issues associated with allocating 
large heap to a single JVM process. This option allows you to run multiple 
executors for an application on a single machine and each executor can be 
configured with optimal memory.


Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Simone Franzini 
[mailto:captainfr...@gmail.com<mailto:captainfr...@gmail.com>]
Sent: Monday, May 2, 2016 9:27 AM
To: user
Subject: Fwd: Spark standalone workers, executors and JVMs

I am still a little bit confused about workers, executors and JVMs in 
standalone mode.
Are worker processes and executors independent JVMs or do executors run within 
the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying 
massive JVMs due to well known performance issues (GC and such).
As of Spark 1.4 it is possible to either deploy multiple workers 
(SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per worker 
(--executor-cores). Which option is preferable and why?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini




RE: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Mohammed Guller
You can run multiple Spark applications simultaneously. Just limit the # of 
cores and memory allocated to each application. For example, if each node has 8 
cores and there are 10 nodes and you want to be able to run 4 applications 
simultaneously, limit the # of cores for each application to 20. Similarly, you 
can limit the amount of memory that an application can use on each node.

You can also use dynamic resource allocation.
Details are here: 
http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

Mohammed
Author: Big Data Analytics with 
Spark

From: Tobias Eriksson [mailto:tobias.eriks...@qvantel.com]
Sent: Tuesday, May 3, 2016 7:34 AM
To: user@spark.apache.org
Subject: Multiple Spark Applications that use Cassandra, how to share 
resources/nodes

Hi
 We are using Spark for a long running job, in fact it is a REST-server that 
does some joins with some tables in Casandra and returns the result.
Now we need to have multiple applications running in the same Spark cluster, 
and from what I understand this is not possible, or should I say somewhat 
complicated

  1.  A Spark application takes all the resources / nodes in the cluster (we 
have 4 nodes one for each Cassandra Node)
  2.  A Spark application returns it’s resources when it is done (exits or the 
context is closed/returned)
  3.  Sharing resources using Mesos only allows scaling down and then scaling 
up by a step-by-step policy, i.e. 2 nodes, 3 nodes, 4 nodes, … And increases as 
the need increases
But if this is true, I can not have several applications running in parallell, 
is that true ?
If I use Mesos then the whole idea with one Spark Worker per Cassandra Node 
fails, as it talks directly to a node, and that is how it is so efficient.
In this case I need all nodes, not 3 out of 4.

Any mistakes in my thinking ?
Any ideas on how to solve this ? Should be a common problem I think

-Tobias




RE: Spark standalone workers, executors and JVMs

2016-05-02 Thread Mohammed Guller
The workers and executors run as separate JVM processes in the standalone mode.

The use of multiple workers on a single machine depends on how you will be 
using the clusters. If you run multiple Spark applications simultaneously, each 
application gets its own its executor. So, for example, if you allocate 8GB to 
each application, you can run 192/8 Spark applications simultaneously (assuming 
you also have large number of cores). Each executor has only 8GB heap, so GC 
should not be issue. Alternatively, if you know that you will have few 
applications running simultaneously on that cluster, running multiple workers 
on each machine will allow you to avoid GC issues associated with allocating 
large heap to a single JVM process. This option allows you to run multiple 
executors for an application on a single machine and each executor can be 
configured with optimal memory.


Mohammed
Author: Big Data Analytics with 
Spark

From: Simone Franzini [mailto:captainfr...@gmail.com]
Sent: Monday, May 2, 2016 9:27 AM
To: user
Subject: Fwd: Spark standalone workers, executors and JVMs

I am still a little bit confused about workers, executors and JVMs in 
standalone mode.
Are worker processes and executors independent JVMs or do executors run within 
the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying 
massive JVMs due to well known performance issues (GC and such).
As of Spark 1.4 it is possible to either deploy multiple workers 
(SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per worker 
(--executor-cores). Which option is preferable and why?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini



RE: Create tab separated file from a dataframe spark 1.4 with Java

2016-04-21 Thread Mohammed Guller
It should be straightforward to do this using the spark-csv package. Assuming 
“myDF” is your DataFrame, you can use the following code to save data in  a TSV 
file.

myDF.write
.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.save("data.tsv")

Mohammed

From: Mail.com [mailto:pradeep.mi...@mail.com]
Sent: Thursday, April 21, 2016 12:29 PM
To: pradeep.mi...@mail.com
Cc: user@spark.apache.org
Subject: Create tab separated file from a dataframe spark 1.4 with Java

Hi

I have a dataframe and need to write to a tab separated file using spark 1.4 
and Java.

Can some one please suggest.

Thanks,
Pradeep


Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Mohammed Guller
My book on Spark was recently published. I would like to request it to be added 
to the Books section on Spark's website.

Here are the details about the book.

Title: Big Data Analytics with Spark
Author: Mohammed Guller
Link: 
www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

Brief Description:
This book is a hands-on guide for learning how to use Spark for different types 
of analytics, including batch, interactive, graph, and stream data analytics as 
well as machine learning. It covers Spark core, Spark SQL, DataFrames, Spark 
Streaming, GraphX, MLlib, and Spark ML. Plenty of examples are provided for the 
readers to practice with.

In addition to covering Spark in depth, the book provides an introduction to 
other big data technologies that are commonly used along with Spark, such as 
HDFS, Parquet, Kafka, Avro, Cassandra, HBase, Mesos, and YARN. The book also 
includes a primer on functional programming and Scala.

Please let me know if you need any other information.

Thanks,
Mohammed




Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Mohammed Guller
My book on Spark was recently published. I would like to request it to be added 
to the Books section on Spark's website.

Here are the details about the book.

Title: Big Data Analytics with Spark
Author: Mohammed Guller
Link: 
www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

Brief Description:
This book is a hands-on guide for learning how to use Spark for different types 
of analytics, including batch, interactive, graph, and stream data analytics as 
well as machine learning. It covers Spark core, Spark SQL, DataFrames, Spark 
Streaming, GraphX, MLlib, and Spark ML. Plenty of examples are provided for the 
readers to practice with.

In addition to covering Spark in depth, the book provides an introduction to 
other big data technologies that are commonly used along with Spark, such as 
HDFS, Parquet, Kafka, Avro, Cassandra, HBase, Mesos, and YARN. The book also 
includes a primer on functional programming and Scala.

Please let me know if you need any other information.

Thanks,
Mohammed




updating the Books section on the Spark documentation page

2016-03-08 Thread Mohammed Guller
Hi -

The Spark documentation page (http://spark.apache.org/documentation.html) has 
links to books covering Spark. What is the process for adding a new book to 
that list?

Thanks,
Mohammed
Author: Big Data Analytics with 
Spark




RE: convert SQL multiple Join in Spark

2016-03-03 Thread Mohammed Guller
Why not use Spark SQL?

Mohammed
Author: Big Data Analytics with 
Spark

From: Vikash Kumar [mailto:vikashsp...@gmail.com]
Sent: Wednesday, March 2, 2016 8:29 PM
To: user@spark.apache.org
Subject: convert SQL multiple Join in Spark


I have to write or convert below SQL query into spark/scala. Anybody can 
suggest how to implement this in Spark?

SELECT a.PERSON_ID as RETAINED_PERSON_ID,

a.PERSON_ID,

a.PERSONTYPE,

'y' as HOLDOUT,

d.LOCATION,

b.HHID,

a.AGE_OUTPUT as AGE,

a.FIRST_NAME,

d.STARTDATE,

d.ENDDATE,

'Not In Campaign' as HH_TYPE

FROM PERSON_MASTER_VIEW a

INNER JOIN PERSON_ADDRESS_HH_KEYS b

on a.PERSON_ID = b.PERSON_ID

LEFT JOIN #Holdouts c

on a.PERSON_ID = c.RETAINED_PERSON_ID

INNER JOIN #Holdouts d

on b.HHID = d.HHID

WHERE c.RETAINED_PERSON_ID IS NULL and 
a.PERSONTYPE IS NOT NULL

GROUP BY a.PERSON_ID, a.PERSONTYPE, b.HHID, 
a.AGE_OUTPUT, a.FIRST_NAME, d.LOCATION, d.STARTDATE, d.ENDDATE


RE: Stage contains task of large size

2016-03-03 Thread Mohammed Guller
Just to elaborate more on what Silvio wrote below, check whether you are 
referencing a class or object member variable in a function literal/closure 
passed to one of the RDD methods.

Mohammed
Author: Big Data Analytics with 
Spark

From: Silvio Fiorito [mailto:silvio.fior...@granturing.com]
Sent: Wednesday, March 2, 2016 8:43 PM
To: Bijuna; user
Subject: RE: Stage contains task of large size




One source of this could be more than you intended (or realized) getting 
serialized as part of your operations. What are the transformations you're 
using? Are you referencing local instance variables in your driver app, as part 
of your transformations? You may have a large collection for instance which 
you're using in your transformation that will get serialized and sent to each 
executor. If you do have something like that look to use broadcast variables 
instead.





From: Bijuna
Sent: Wednesday, March 2, 2016 11:20 PM
To: user@spark.apache.org
Subject: Stage contains task of large size


Spark users,

We are running spark application in standalone mode. We see warn messages in 
the logs which says

Stage 46 contains a task of very large size (983 KB) . The maximum recommended 
task size is 100 KB.

What is the recommended approach to fix this warning. Please let me know.

Thank you
Bijuna

Sent from my iPad
-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org


RE: Update edge weight in graphx

2016-03-01 Thread Mohammed Guller
Like RDDs, Graphs are also immutable.

Mohammed
Author: Big Data Analytics with Spark


-Original Message-
From: naveen.marri [mailto:naveenkumarmarri6...@gmail.com] 
Sent: Monday, February 29, 2016 9:11 PM
To: user@spark.apache.org
Subject: Update edge weight in graphx

Hi,
   
 I'm trying to implement an algorithm using graphx which involves updating 
edge weight during every iteration. the format is [Node]-[Node]--[Weight]
  Ex: 
  I checked in docs of graphx but didn't find any resources to change the 
weight of the edge for a same RDD 
 I know RDDs are immutable , is there any way to do this in graphx
 Also is there any way to dynamically add vertices and edges to the graph 
within same RDD?

 Regards,
 Naveen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Update-edge-weight-in-graphx-tp26367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-03-01 Thread Mohammed Guller
I agree that the Spark official documentation is pretty good. However, a book 
also serves a useful purpose. It provides a structured roadmap for learning a 
new technology. Everything is nicely organized for the reader. For somebody who 
has just started learning Spark, the amount of material on the Internet can be 
overwhelming. There are ton of blogs and presentations on the Internet. A 
beginner could easily spend months reading them and still be lost. If you are 
experienced, it is easy to figure out what to read and what to skip.

I also agree that a book becomes outdated at some point, but not right away. 
For example, a book covering DataFrames and Spark ML is not outdated yet.

Mohammed
Author: Big Data Analytics with 
Spark

From: charles li [mailto:charles.up...@gmail.com]
Sent: Monday, February 29, 2016 1:39 AM
To: Ashok Kumar
Cc: User
Subject: Re: Recommendation for a good book on Spark, beginner to moderate 
knowledge

since spark is under actively developing, so take a book to learn it is somehow 
outdated to some degree.

I would like to suggest learn it from several ways as bellow:


  *   spark official document, trust me, you will go through this for several 
time if you want to learn in well : http://spark.apache.org/
  *   spark summit, lots of videos and slide, high quality : 
https://spark-summit.org/
  *   databricks' blog : https://databricks.com/blog
  *   attend spark meetup : http://www.meetup.com/
  *   try spark 3-party package if needed and convenient : 
http://spark-packages.org/
  *   and I just start to blog my spark learning memo on my blog: 
http://litaotao.github.io

in a word, I think the best way to learn it is official document + databricks 
blog + others' blog ===>>> your blog [ tutorial by you or just memo for your 
learning ]

On Mon, Feb 29, 2016 at 4:50 PM, Ashok Kumar 
> wrote:
Thank you all for valuable advice. Much appreciated

Best

On Sunday, 28 February 2016, 21:48, Ashok Kumar 
> wrote:

  Hi Gurus,

Appreciate if you recommend me a good book on Spark or documentation for 
beginner to moderate knowledge

I very much like to skill myself on transformation and action methods.

FYI, I have already looked at examples on net. However, some of them not clear 
at least to me.

Warmest regards




--
--
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao


RE: Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Mohammed Guller
I believe the OP is referring to the application UI on port 4040.

The application UI on port 4040 is available only while application is running. 
As per the documentation:
To view the web UI after the fact, set spark.eventLog.enabled to true before 
starting the application. This configures Spark to log Spark events that encode 
the information displayed in the UI to persisted storage.

Mohammed
Author: Big Data Analytics with 
Spark

From: Shixiong(Ryan) Zhu [mailto:shixi...@databricks.com]
Sent: Monday, February 29, 2016 4:03 PM
To: Sumona Routh
Cc: user@spark.apache.org
Subject: Re: Spark UI standalone "crashes" after an application finishes

Do you mean you cannot access Master UI after your application completes? Could 
you check the master log?

On Mon, Feb 29, 2016 at 3:48 PM, Sumona Routh 
> wrote:
Hi there,
I've been doing some performance tuning of our Spark application, which is 
using Spark 1.2.1 standalone. I have been using the spark metrics to graph out 
details as I run the jobs, as well as the UI to review the tasks and stages.
I notice that after my application completes, or is near completion, the UI 
"crashes." I get a Connection Refused response. Sometimes, the page eventually 
recovers and will load again, but sometimes I end up having to restart the 
Spark master to get it back. When I look at my graphs on the app, the memory 
consumption (of driver, executors, and what I believe to be the daemon 
(spark.jvm.total.used)) appears to be healthy. Monitoring the master machine 
itself, memory and CPU appear healthy as well.
Has anyone else seen this issue? Are there logs for the UI itself, and where 
might I find those?
Thanks!
Sumona



RE: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Mohammed Guller
Hi Ashok,

Another book recommendation (I am the author): “Big Data Analytics with Spark”

The first half of the book is specifically written for people just getting 
started with Big Data and Spark.

Mohammed
Author: Big Data Analytics with 
Spark

From: Suhaas Lang [mailto:suhaas.l...@gmail.com]
Sent: Sunday, February 28, 2016 6:21 PM
To: Jules Damji
Cc: Ashok Kumar; User
Subject: Re: Recommendation for a good book on Spark, beginner to moderate 
knowledge


Thanks, Jules!
On Feb 28, 2016 7:47 PM, "Jules Damji" 
> wrote:
Suhass,

When I referred to interactive shells, I was referring the the Scala & Python 
interactive language shells. Both Python & Scala come with respective 
interacive shells. By just typing “python” or “scala” (assume the installation 
bin directory is in your $PATH), it will put fire up the shell.

As for the “pyspark” and “spark-shell”, they both come with the Spark 
installation and are in $spark_install_dir/bin directory.

Have a go at them. Best way to learn the language.

Cheers
Jules

--
“Language is the palate from which we draw all colors of our life.”
Jules Damji
dmat...@comcast.net




On Feb 28, 2016, at 4:08 PM, Suhaas Lang 
> wrote:


Jules,

Could you please post links to these interactive shells for Python and Scala?
On Feb 28, 2016 5:32 PM, "Jules Damji" 
> wrote:
Hello Ashoka,

"Learning Spark," from O'Reilly, is certainly a good start, and all basic video 
tutorials from Spark Summit Training, "Spark Essentials", are excellent 
supplementary materials.

And the best (and most effective) way to teach yourself is really firing up the 
spark-shell or pyspark and doing it yourself—immersing yourself by trying all 
basic transformations and actions on RDDs, with contrived small data sets.

I've discovered that learning Scala & Python through their interactive shell, 
where feedback is immediate and response is quick, as the best learning 
experience.

Same is true for Scala or Python Notebooks interacting with a Spark, running in 
local or cluster mode.

Cheers,

Jules

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 28, 2016, at 1:48 PM, Ashok Kumar 
> wrote:
  Hi Gurus,

Appreciate if you recommend me a good book on Spark or documentation for 
beginner to moderate knowledge

I very much like to skill myself on transformation and action methods.

FYI, I have already looked at examples on net. However, some of them not clear 
at least to me.

Warmest regards



RE: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-27 Thread Mohammed Guller
Perhaps, the documentation of the filter method would help. Here is the method 
signature (copied from the API doc)

def  filter[VD2, ED2](preprocess: (Graph[VD, ED]) => Graph[VD2, ED2], epred: 
(EdgeTriplet[VD2, ED2]) => Boolean = (x: EdgeTriplet[VD2, ED2]) => true, vpred: 
(VertexId, VD2) => Boolean = (v: VertexId, d: VD2) => true)
This method returns a subgraph of the original graph. The  data in the original 
graph remains unchanged. Brief description of the arguments:

VD2:vertex type the vpred operates on
ED2:edge type the epred operates on
preprocess:   a function to compute new vertex and edge data before filtering
epred:   edge predicate to filter on after preprocess
vpred:   vertex predicate to filter on after prerocess

In the solution below, the first function literal is the preprocess argument. 
The vpred argument is passed as named argument since we are using the default 
value for epred.

HTH.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Guillermo Ortiz [mailto:konstt2...@gmail.com]
Sent: Saturday, February 27, 2016 6:17 AM
To: Mohammed Guller
Cc: Robin East; user
Subject: Re: Get all vertexes with outDegree equals to 0 with GraphX

Thank you, I have to think what the code does,, because I am a little noob in 
scala and it's hard to understand it to me.

2016-02-27 3:53 GMT+01:00 Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>>:
Here is another solution (minGraph is the graph from your code. I assume that 
is your original graph):

val graphWithNoOutEdges = minGraph.filter(
  graph => graph.outerJoinVertices(graph.outDegrees) {(vId, vData, 
outDegreesOpt) => outDegreesOpt.getOrElse(0)},
  vpred = (vId: VertexId, vOutDegrees: Int) => vOutDegrees == 0
)

val verticesWithNoOutEdges = graphWithNoOutEdges.vertices

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Guillermo Ortiz [mailto:konstt2...@gmail.com<mailto:konstt2...@gmail.com>]
Sent: Friday, February 26, 2016 5:46 AM
To: Robin East
Cc: user
Subject: Re: Get all vertexes with outDegree equals to 0 with GraphX

Yes, I am not really happy with that "collect".
I was taking a look to use subgraph method and others options and didn't figure 
out anything easy or direct..

I'm going to try your idea.

2016-02-26 14:16 GMT+01:00 Robin East 
<robin.e...@xense.co.uk<mailto:robin.e...@xense.co.uk>>:
Whilst I can think of other ways to do it I don’t think they would be 
conceptually or syntactically any simpler. GraphX doesn’t have the concept of 
built-in vertex properties which would make this simpler - a vertex in GraphX 
is a Vertex ID (Long) and a bunch of custom attributes that you assign. This 
means you have to find a way of ‘pushing’ the vertex degree into the graph so 
you can do comparisons (cf a join in relational databases) or as you have done 
create a list and filter against that (cf filtering against a sub-query in 
relational database).

One thing I would point out is that you probably want to avoid 
finalVerexes.collect() for a large-scale system - this will pull all the 
vertices into the driver and then push them out to the executors again as part 
of the filter operation. A better strategy for large graphs would be:

1. build a graph based on the existing graph where the vertex attribute is the 
vertex degree - the GraphX documentation shows how to do this
2. filter this “degrees” graph to just give you 0 degree vertices
3 use graph.mask passing in the 0-degree graph to get the original graph with 
just 0 degree vertices

Just one variation on several possibilities, the key point is that everything 
is just a graph transformation until you call an action on the resulting graph
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action




On 26 Feb 2016, at 11:59, Guillermo Ortiz 
<konstt2...@gmail.com<mailto:konstt2...@gmail.com>> wrote:

I'm new with graphX. I need to get the vertex without out edges..
I guess that it's pretty easy but I did it pretty complicated.. and inefficienct


val vertices: RDD[(VertexId, (List[String], List[String]))] =
  sc.parallelize(Array((1L, (List("a"), List[String]())),
(2L, (List("b"), List[String]())),
(3L, (List("c"), List[String]())),
(4L, (List("d"), List[String]())),
(5L, (List("e"), List[String]())),
(6L, (List("f"), List[String]()

// Create an RDD for edges
val relationships: RDD[Edge[Boolean]] =
  sc.parallelize(Array(Edge(1L, 2L, true), Edge(2L, 3L, true), Edge(3L, 4L, 
true), Edge(5L, 2L, true)))

val out = minGraph.outDegrees.map(vertex => vertex._1)

RE: Standalone vs. Mesos for production installation on a smallish cluster

2016-02-26 Thread Mohammed Guller
I think you may be referring to Spark Survey 2015. According to that survey, 
48% use standalone, 40% use YARN and only 11% use Mesos (the numbers don’t add 
up to 100 – probably because of rounding error).

Mohammed
Author: Big Data Analytics with 
Spark

From: Igor Berman [mailto:igor.ber...@gmail.com]
Sent: Friday, February 26, 2016 3:52 AM
To: Petr Novak
Cc: user
Subject: Re: Standalone vs. Mesos for production installation on a smallish 
cluster

Imho most of production clusters are standalone
there was some presentation from spark summit with some stats inside(can't find 
right now), so standalone was at 1st place
it was from Matei
https://databricks.com/resources/slides

On 26 February 2016 at 13:40, Petr Novak 
> wrote:
Hi all,
I believe that it used to be in documentation that Standalone mode is not for 
production. I'm either wrong or it was already removed.

Having a small cluster between 5-10 nodes is Standalone recommended for 
production? I would like to go with Mesos but the question is if there is real 
add-on value for production, mainly from stability perspective.

Can I expect that adding Mesos will improve stability compared to Standalone to 
the extent to justify itself according to somewhat increased complexity?

I know it is hard to answer because Mesos layer itself is going to add some 
bugs as well.

Are there unique features enabled by Mesos specific to Spark? E.g. adaptive 
resources for jobs or whatever?

In the future once cluster will grow and more services running on Mesos, we 
plan to use Mesos. The question is if it does worth to go with it immediately 
even maybe its utility is not directly needed at this point.

Many thanks,
Petr



RE: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-26 Thread Mohammed Guller
Here is another solution (minGraph is the graph from your code. I assume that 
is your original graph):

val graphWithNoOutEdges = minGraph.filter(
  graph => graph.outerJoinVertices(graph.outDegrees) {(vId, vData, 
outDegreesOpt) => outDegreesOpt.getOrElse(0)},
  vpred = (vId: VertexId, vOutDegrees: Int) => vOutDegrees == 0
)

val verticesWithNoOutEdges = graphWithNoOutEdges.vertices

Mohammed
Author: Big Data Analytics with 
Spark

From: Guillermo Ortiz [mailto:konstt2...@gmail.com]
Sent: Friday, February 26, 2016 5:46 AM
To: Robin East
Cc: user
Subject: Re: Get all vertexes with outDegree equals to 0 with GraphX

Yes, I am not really happy with that "collect".
I was taking a look to use subgraph method and others options and didn't figure 
out anything easy or direct..

I'm going to try your idea.

2016-02-26 14:16 GMT+01:00 Robin East 
>:
Whilst I can think of other ways to do it I don’t think they would be 
conceptually or syntactically any simpler. GraphX doesn’t have the concept of 
built-in vertex properties which would make this simpler - a vertex in GraphX 
is a Vertex ID (Long) and a bunch of custom attributes that you assign. This 
means you have to find a way of ‘pushing’ the vertex degree into the graph so 
you can do comparisons (cf a join in relational databases) or as you have done 
create a list and filter against that (cf filtering against a sub-query in 
relational database).

One thing I would point out is that you probably want to avoid 
finalVerexes.collect() for a large-scale system - this will pull all the 
vertices into the driver and then push them out to the executors again as part 
of the filter operation. A better strategy for large graphs would be:

1. build a graph based on the existing graph where the vertex attribute is the 
vertex degree - the GraphX documentation shows how to do this
2. filter this “degrees” graph to just give you 0 degree vertices
3 use graph.mask passing in the 0-degree graph to get the original graph with 
just 0 degree vertices

Just one variation on several possibilities, the key point is that everything 
is just a graph transformation until you call an action on the resulting graph
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action




On 26 Feb 2016, at 11:59, Guillermo Ortiz 
> wrote:

I'm new with graphX. I need to get the vertex without out edges..
I guess that it's pretty easy but I did it pretty complicated.. and inefficienct


val vertices: RDD[(VertexId, (List[String], List[String]))] =
  sc.parallelize(Array((1L, (List("a"), List[String]())),
(2L, (List("b"), List[String]())),
(3L, (List("c"), List[String]())),
(4L, (List("d"), List[String]())),
(5L, (List("e"), List[String]())),
(6L, (List("f"), List[String]()

// Create an RDD for edges
val relationships: RDD[Edge[Boolean]] =
  sc.parallelize(Array(Edge(1L, 2L, true), Edge(2L, 3L, true), Edge(3L, 4L, 
true), Edge(5L, 2L, true)))

val out = minGraph.outDegrees.map(vertex => vertex._1)

val finalVertexes = minGraph.vertices.keys.subtract(out)

//It must be something better than this way..
val nodes = finalVertexes.collect()
val result = minGraph.vertices.filter(v => nodes.contains(v._1))



What's the good way to do this operation? It seems that it should be pretty 
easy.




RE: Clarification on RDD

2016-02-26 Thread Mohammed Guller
HDFS, as the name implies, is a distributed file system. A file stored on HDFS 
is already distributed. So if you create an RDD from a HDFS file, the created 
RDD just points to the file partitions on different nodes.

You can read more about HDFS here.

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

Mohammed
Author: Big Data Analytics with 
Spark

From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID]
Sent: Friday, February 26, 2016 9:41 AM
To: User
Subject: Clarification on RDD

Hi,

Spark doco says

Spark’s primary abstraction is a distributed collection of items called a 
Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop 
InputFormats (such as HDFS files) or by transforming other RDDs

example:

val textFile = sc.textFile("README.md")


my question is when RDD is created like above from a file stored on HDFS, does 
that mean that data is distributed among all the nodes in the cluster or data 
from the md file is copied to each node of the cluster so each node has 
complete copy of data? Has the data is actually moved around or data is not 
copied over until an action like COUNT() is performed on RDD?

Thanks



RE: Can we load csv partitioned data into one DF?

2016-02-22 Thread Mohammed Guller
Are all the csv files in the same directory?

Mohammed
Author: Big Data Analytics with 
Spark

From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com]
Sent: Monday, February 22, 2016 7:25 AM
To: user@spark.apache.org
Subject: Can we load csv partitioned data into one DF?

Hello all, I am facing a silly data question.

If I have +100 csv files which are part of the same data, but each csv is for 
example, a year on a timeframe column (i.e. partitioned by year),
what would you suggest instead of loading all those files and joining them?

Final target would be parquet. Is it possible, for example, to load them and 
then store them as parquet, and then read parquet and consider all as one?

Thanks for any suggestions,
Saif



RE: Check if column exists in Schema

2016-02-15 Thread Mohammed Guller
The DataFrame class has a method named columns, which returns all column names 
as an array. You can then use the contains method in the Scala Array class to 
check whether a column exists.

Mohammed
Author: Big Data Analytics with 
Spark

From: Sebastian Piu [mailto:sebastian@gmail.com]
Sent: Monday, February 15, 2016 11:21 AM
To: user
Subject: Re: Check if column exists in Schema

I just realised this is a bit vague, I'm looking to create a function that 
looks into different columns to get a value. So depending on a type I might 
look into a given path or another (which might or might not exist).

Example if column some.path.to.my.date exists I'd return that, if it doesn't or 
it is null, i'd get it from some other place

On Mon, Feb 15, 2016 at 7:17 PM Sebastian Piu 
> wrote:
Is there any way of checking if a given column exists in a Dataframe?


RE: [MLLIB] Best way to extract RandomForest decision splits

2016-02-10 Thread Mohammed Guller
Why not use the save method from the RandomForestModel class to save a model at 
a specified path?



Mohammed

Author: Big Data Analytics with Spark





-Original Message-
From: jluan [mailto:jaylu...@gmail.com]
Sent: Wednesday, February 10, 2016 5:57 PM
To: user@spark.apache.org
Subject: [MLLIB] Best way to extract RandomForest decision splits



I've trained a RandomForest classifier where I can print my model's decisions

using model.toDebugString



However I was wondering if there's a way to extract tree programmatically by

traversing the nodes or in some other way such that I can write my own

decision file rather than just a debug string.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-Best-way-to-extract-RandomForest-decision-splits-tp26201.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org

For additional commands, e-mail: 
user-h...@spark.apache.org




RE: [MLLIB] Best way to extract RandomForest decision splits

2016-02-10 Thread Mohammed Guller
Yes, a model saved with the save method can be read back only by the load 
method in the RandomForestModel object.

Unfortunately, I don’t know any better mechanism for what you are trying to do. 
There was a discussion on this topic a few days ago, so if you search the 
mailing list archives, you may be able to find other alternatives.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Jay Luan [mailto:jaylu...@gmail.com]
Sent: Wednesday, February 10, 2016 7:27 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: RE: [MLLIB] Best way to extract RandomForest decision splits


Thanks for the reply, I'd like to export the decision splits for each tree out 
to an external file which is read elsewhere not using spark. As far as I know, 
saving a model to a path will save a bunch of binary files which can be loaded 
back into spark. Is this correct?
On Feb 10, 2016 7:21 PM, "Mohammed Guller" 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:

Why not use the save method from the RandomForestModel class to save a model at 
a specified path?



Mohammed

Author: Big Data Analytics with Spark





-Original Message-
From: jluan [mailto:jaylu...@gmail.com<mailto:jaylu...@gmail.com>]
Sent: Wednesday, February 10, 2016 5:57 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: [MLLIB] Best way to extract RandomForest decision splits



I've trained a RandomForest classifier where I can print my model's decisions

using model.toDebugString



However I was wondering if there's a way to extract tree programmatically by

traversing the nodes or in some other way such that I can write my own

decision file rather than just a debug string.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-Best-way-to-extract-RandomForest-decision-splits-tp26201.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>




RE: reducing disk space consumption

2016-02-10 Thread Mohammed Guller
If I remember it correctly, C* creates a snapshot when you drop a keyspace. Run 
the following command to get rid of the snapshot:
nodetool clearsnapshot

Mohammed
Author: Big Data Analytics with 
Spark

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, February 10, 2016 6:59 AM
To: user@cassandra.apache.org
Subject: reducing disk space consumption

Hi,
I am using DSE 4.8.4
On one node, disk space is low where:

42G /var/lib/cassandra/data/usertable/data-0abea7f0cf9211e5a355bf8dafbfa99c

Using CLI, I dropped keyspace usertable but the data dir above still consumes 
42G.

What action would free this part of disk (I don't need the data) ?

Thanks


RE: spark-cassandra-connector BulkOutputWriter

2016-02-09 Thread Mohammed Guller
Alex – I suggest posting this question on the Spark Cassandra Connector mailing 
list. The SCC developers are pretty responsive.

Mohammed
Author: Big Data Analytics with 
Spark

From: Alexandr Dzhagriev [mailto:dzh...@gmail.com]
Sent: Tuesday, February 9, 2016 6:52 AM
To: user
Subject: spark-cassandra-connector BulkOutputWriter

Hello all,

I looked through the cassandra spark integration 
(https://github.com/datastax/spark-cassandra-connector) and couldn't find any 
usages of the BulkOutputWriter (http://www.datastax.com/dev/blog/bulk-loading) 
- an awesome tool for creating local sstables, which could be later uploaded to 
a cassandra cluster.  Seems like (sorry if I'm wrong), it uses just bulk insert 
statements. So, my question is: does anybody know if there are any plans to 
support bulk loading?

Cheers, Alex.


RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
You may have better luck with this question on the Spark Cassandra Connector 
mailing list.



One quick question about this code from your email:

   // Load DataFrame from C* data-source

   val base_data = base_data_df.getInstance(sqlContext)



What exactly is base_data_df and how are you creating it?

Mohammed
Author: Big Data Analytics with 
Spark



-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 6:58 AM
To: user@spark.apache.org
Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames



All,



I'm new to Spark and I'm having a hard time doing a simple join of two DFs



Intent:

-  I'm receiving data from Kafka via direct stream and would like to enrich the 
messages with data from Cassandra. The Kafka messages

(Protobufs) are decoded into DataFrames and then joined with a (supposedly 
pre-filtered) DF from Cassandra. The relation of (Kafka) streaming batch size 
to raw C* data is [several streaming messages to millions of C* rows], BUT the 
join always yields exactly ONE result [1:1] per message. After the join the 
resulting DF is eventually stored to another C* table.



Problem:

- Even though I'm joining the two DFs on the full Cassandra primary key and 
pushing the corresponding filter to C*, it seems that Spark is loading the 
whole C* data-set into memory before actually joining (which I'd like to 
prevent by using the filter/predicate pushdown).

This leads to a lot of shuffling and tasks being spawned, hence the "simple" 
join takes forever...



Could anyone shed some light on this? In my perception this should be a 
prime-example for DFs and Spark Streaming.



Environment:

- Spark 1.6

- Cassandra 2.1.12

- Cassandra-Spark-Connector 1.5-RC1

- Kafka 0.8.2.2



Code:



def main(args: Array[String]) {

 val conf = new SparkConf()

   .setAppName("test")

   .set("spark.cassandra.connection.host", "xxx")

   .set("spark.cassandra.connection.keep_alive_ms", "3")

   .setMaster("local[*]")



 val ssc = new StreamingContext(conf, Seconds(10))

 ssc.sparkContext.setLogLevel("INFO")



 // Initialise Kafka

 val kafkaTopics = Set[String]("xxx")

 val kafkaParams = Map[String, String](

   "metadata.broker.list" -> "xxx:32000,xxx:32000,xxx:32000,xxx:32000",

   "auto.offset.reset" -> "smallest")



 // Kafka stream

 val messages = KafkaUtils.createDirectStream[String, MyMsg, StringDecoder, 
MyMsgDecoder](ssc, kafkaParams, kafkaTopics)



 // Executed on the driver

 messages.foreachRDD { rdd =>



   // Create an instance of SQLContext

   val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)

   import sqlContext.implicits._



   // Map MyMsg RDD

   val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}



   // Convert RDD[MyMsg] to DataFrame

   val MyMsgDf = MyMsgRdd.toDF()

.select(

 $"prim1Id" as 'prim1_id,

 $"prim2Id" as 'prim2_id,

 $...

   )



   // Load DataFrame from C* data-source

   val base_data = base_data_df.getInstance(sqlContext)



   // Inner join on prim1Id and prim2Id

   val joinedDf = MyMsgDf.join(base_data,

 MyMsgDf("prim1_id") === base_data("prim1_id") &&

 MyMsgDf("prim2_id") === base_data("prim2_id"), "left")

 .filter(base_data("prim1_id").isin(MyMsgDf("prim1_id"))

 && base_data("prim2_id").isin(MyMsgDf("prim2_id")))



   joinedDf.show()

   joinedDf.printSchema()



   // Select relevant fields



   // Persist



 }



 // Start the computation

 ssc.start()

 ssc.awaitTermination()

}



SO:

http://stackoverflow.com/questions/35295182/joining-kafka-and-cassandra-dataframes-in-spark-streaming-ignores-c-predicate-p







-

To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.org




RE: How to collect/take arbitrary number of records in the driver?

2016-02-09 Thread Mohammed Guller
You can do something like this:



val indexedRDD = rdd.zipWithIndex

val filteredRDD = indexedRDD.filter{case(element, index) => (index >= 99) && 
(index < 199)}

val result = filteredRDD.take(100)



Warning: the ordering of the elements in the RDD is not guaranteed.

Mohammed
Author: Big Data Analytics with 
Spark



-Original Message-
From: SRK [mailto:swethakasire...@gmail.com]
Sent: Tuesday, February 9, 2016 1:58 PM
To: user@spark.apache.org
Subject: How to collect/take arbitrary number of records in the driver?



Hi ,



How to get a fixed amount of records from an RDD in Driver? Suppose I want the 
records from 100 to 1000 and then save them to some external database, I know 
that I can do it from Workers in partition but I want to avoid that for some 
reasons. The idea is to collect the data to driver and save, although slowly.



I am looking for something like take(100, 1000)  or take (1000,2000)



Thanks,

Swetha







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-collect-take-arbitrary-number-of-records-in-the-driver-tp26184.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.org




RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
Hi Bernhard,

Take a look at the examples shown under the "Pushing down clauses to Cassandra" 
sections on this page:

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md


Mohammed
Author: Big Data Analytics with Spark

-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch] 
Sent: Tuesday, February 9, 2016 10:05 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

Hi Mohammed

Thanks for hint, I should probably do that :)

As for the DF singleton:

/**
  * Lazily instantiated singleton instance of base_data DataFrame
  */
object base_data_df {

   @transient private var instance: DataFrame = _

   def getInstance(sqlContext: SQLContext): DataFrame = {
 if (instance == null) {
   // Load DataFrame with C* data-source
   instance = sqlContext.read
 .format("org.apache.spark.sql.cassandra")
 .options(Map("table" -> "cf", "keyspace" -> "ks"))
     .load()
 }
 instance
   }
}

Bernhard

Quoting Mohammed Guller <moham...@glassbeam.com>:

> You may have better luck with this question on the Spark Cassandra 
> Connector mailing list.
>
>
>
> One quick question about this code from your email:
>
>// Load DataFrame from C* data-source
>
>val base_data = base_data_df.getInstance(sqlContext)
>
>
>
> What exactly is base_data_df and how are you creating it?
>
> Mohammed
> Author: Big Data Analytics with
> Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/
> 1484209656/>
>
>
>
> -Original Message-
> From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
> Sent: Tuesday, February 9, 2016 6:58 AM
> To: user@spark.apache.org
> Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames
>
>
>
> All,
>
>
>
> I'm new to Spark and I'm having a hard time doing a simple join of two 
> DFs
>
>
>
> Intent:
>
> -  I'm receiving data from Kafka via direct stream and would like to 
> enrich the messages with data from Cassandra. The Kafka messages
>
> (Protobufs) are decoded into DataFrames and then joined with a 
> (supposedly pre-filtered) DF from Cassandra. The relation of (Kafka) 
> streaming batch size to raw C* data is [several streaming messages to 
> millions of C* rows], BUT the join always yields exactly ONE result 
> [1:1] per message. After the join the resulting DF is eventually 
> stored to another C* table.
>
>
>
> Problem:
>
> - Even though I'm joining the two DFs on the full Cassandra primary 
> key and pushing the corresponding filter to C*, it seems that Spark is 
> loading the whole C* data-set into memory before actually joining 
> (which I'd like to prevent by using the filter/predicate pushdown).
>
> This leads to a lot of shuffling and tasks being spawned, hence the 
> "simple" join takes forever...
>
>
>
> Could anyone shed some light on this? In my perception this should be 
> a prime-example for DFs and Spark Streaming.
>
>
>
> Environment:
>
> - Spark 1.6
>
> - Cassandra 2.1.12
>
> - Cassandra-Spark-Connector 1.5-RC1
>
> - Kafka 0.8.2.2
>
>
>
> Code:
>
>
>
> def main(args: Array[String]) {
>
>  val conf = new SparkConf()
>
>.setAppName("test")
>
>.set("spark.cassandra.connection.host", "xxx")
>
>.set("spark.cassandra.connection.keep_alive_ms", "3")
>
>.setMaster("local[*]")
>
>
>
>  val ssc = new StreamingContext(conf, Seconds(10))
>
>  ssc.sparkContext.setLogLevel("INFO")
>
>
>
>  // Initialise Kafka
>
>  val kafkaTopics = Set[String]("xxx")
>
>  val kafkaParams = Map[String, String](
>
>"metadata.broker.list" -> 
> "xxx:32000,xxx:32000,xxx:32000,xxx:32000",
>
>"auto.offset.reset" -> "smallest")
>
>
>
>  // Kafka stream
>
>  val messages = KafkaUtils.createDirectStream[String, MyMsg, 
> StringDecoder, MyMsgDecoder](ssc, kafkaParams, kafkaTopics)
>
>
>
>  // Executed on the driver
>
>  messages.foreachRDD { rdd =>
>
>
>
>// Create an instance of SQLContext
>
>val sqlContext = 
> SQLContextSingleton.getInstance(rdd.sparkContext)
>
>import sqlContext.implicits._
>
>
>
>// Map MyMsg RDD
>
>val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}
>
>
>
>// Convert RDD[MyMsg] to DataFrame
>
>  

RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
Moving the spark mailing list to BCC since this is not really related to Spark.

May be I am missing something, but where are you calling the filter method on 
the base_data DF to push down the predicates to Cassandra before calling the 
join method? 

Mohammed
Author: Big Data Analytics with Spark


-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch] 
Sent: Tuesday, February 9, 2016 10:47 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

Hi Mohammed

I'm aware of that documentation, what are you hinting at specifically?  
I'm pushing all elements of the partition key, so that should work. As user 
zero323 on SO pointed out it the problem is most probably related to the 
dynamic nature of the predicate elements (two distributed collections per 
filter per join).

The statement "To push down partition keys, all of them must be included, but 
not more than one predicate per partition key, otherwise nothing is pushed 
down."

Does not apply IMO?

Bernhard

Quoting Mohammed Guller <moham...@glassbeam.com>:

> Hi Bernhard,
>
> Take a look at the examples shown under the "Pushing down clauses to 
> Cassandra" sections on this page:
>
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/
> 14_data_frames.md
>
>
> Mohammed
> Author: Big Data Analytics with Spark
>
> -Original Message-
> From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
> Sent: Tuesday, February 9, 2016 10:05 PM
> To: Mohammed Guller
> Cc: user@spark.apache.org
> Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames
>
> Hi Mohammed
>
> Thanks for hint, I should probably do that :)
>
> As for the DF singleton:
>
> /**
>   * Lazily instantiated singleton instance of base_data DataFrame
>   */
> object base_data_df {
>
>@transient private var instance: DataFrame = _
>
>def getInstance(sqlContext: SQLContext): DataFrame = {
>  if (instance == null) {
>// Load DataFrame with C* data-source
>instance = sqlContext.read
>  .format("org.apache.spark.sql.cassandra")
>  .options(Map("table" -> "cf", "keyspace" -> "ks"))
>  .load()
>  }
>  instance
>}
> }
>
> Bernhard
>
> Quoting Mohammed Guller <moham...@glassbeam.com>:
>
>> You may have better luck with this question on the Spark Cassandra 
>> Connector mailing list.
>>
>>
>>
>> One quick question about this code from your email:
>>
>>// Load DataFrame from C* data-source
>>
>>val base_data = base_data_df.getInstance(sqlContext)
>>
>>
>>
>> What exactly is base_data_df and how are you creating it?
>>
>> Mohammed
>> Author: Big Data Analytics with
>> Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp
>> /
>> 1484209656/>
>>
>>
>>
>> -Original Message-
>> From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
>> Sent: Tuesday, February 9, 2016 6:58 AM
>> To: user@spark.apache.org
>> Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames
>>
>>
>>
>> All,
>>
>>
>>
>> I'm new to Spark and I'm having a hard time doing a simple join of 
>> two DFs
>>
>>
>>
>> Intent:
>>
>> -  I'm receiving data from Kafka via direct stream and would like to 
>> enrich the messages with data from Cassandra. The Kafka messages
>>
>> (Protobufs) are decoded into DataFrames and then joined with a 
>> (supposedly pre-filtered) DF from Cassandra. The relation of (Kafka) 
>> streaming batch size to raw C* data is [several streaming messages to 
>> millions of C* rows], BUT the join always yields exactly ONE result 
>> [1:1] per message. After the join the resulting DF is eventually 
>> stored to another C* table.
>>
>>
>>
>> Problem:
>>
>> - Even though I'm joining the two DFs on the full Cassandra primary 
>> key and pushing the corresponding filter to C*, it seems that Spark 
>> is loading the whole C* data-set into memory before actually joining 
>> (which I'd like to prevent by using the filter/predicate pushdown).
>>
>> This leads to a lot of shuffling and tasks being spawned, hence the 
>> "simple" join takes forever...
>>
>>
>>
>> Could anyone shed some light on this? In my perception this should be 
>> a prime-example for DFs and Spark Streaming.
>>
>>
>>
>> Environment:
>>
>> - Spa

RE: submit spark job with spcified file for driver

2016-02-04 Thread Mohammed Guller
Here is the description for the --file option that you can specify to 
spark-submit:

--files FILES   Comma-separated list of files to be placed in the 
working directory of each executor.


Mohammed
Author: Big Data Analytics with Spark


-Original Message-
From: alexeyy3 [mailto:alexey.yakubov...@searshc.com] 
Sent: Thursday, February 4, 2016 2:18 PM
To: user@spark.apache.org
Subject: submit spark job with spcified file for driver

Is it possible to specify a file (with key-value properties) when submitting 
spark app with spark-submit? Some mails refers to the key --file, but docs.
does not mention it.
If you can specify a file, how to access it from spark job driver?

Thank you, Alexey



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/submit-spark-job-with-spcified-file-for-driver-tp26153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: add new column in the schema + Dataframe

2016-02-04 Thread Mohammed Guller
Hi Divya,
You can use the withColumn method from the DataFrame API. Here is the method 
signature:

def withColumn(colName: String, col: 
Column):
 DataFrame


Mohammed
Author: Big Data Analytics with 
Spark

From: Divya Gehlot [mailto:divya.htco...@gmail.com]
Sent: Thursday, February 4, 2016 1:29 AM
To: user @spark
Subject: add new column in the schema + Dataframe

Hi,
I am beginner in spark and using Spark 1.5.2 on YARN.(HDP2.3.4)
I have a use case where I have to read two input files and based on certain  
conditions in second input file ,have to add a new column in the first input 
file and save it .

I am using spark-csv to read my input files .
Would really appreciate if somebody would share their thoughts on best/feasible 
way of doing it(using dataframe API)


Thanks,
Divya




RE: spark-cassandra

2016-02-03 Thread Mohammed Guller
Another thing to check is what version of the Spark-Cassandra-Connector the 
Spark Job server passing to the workers. It looks like when you use 
Spark-submit, you are sending the correct SCC jar, but the Spark Job server may 
be using a different one.

Mohammed
Author: Big Data Analytics with 
Spark

From: Gerard Maas [mailto:gerard.m...@gmail.com]
Sent: Wednesday, February 3, 2016 4:56 AM
To: Madabhattula Rajesh Kumar
Cc: user@spark.apache.org
Subject: Re: spark-cassandra

NoSuchMethodError usually refers to a version conflict. Probably your job was 
built against a higher version of the cassandra connector than what's available 
on the run time.
Check that the versions are aligned.

-kr, Gerard.

On Wed, Feb 3, 2016 at 1:37 PM, Madabhattula Rajesh Kumar 
> wrote:
Hi,
I am using Spark Jobserver to submit the jobs. I am using spark-cassandra 
connector to connect to Cassandra. I am getting below exception through spak 
jobserver.
If I submit the job through Spark-Submit command it is working fine,.
Please let me know how to solve this issue


Exception in thread "pool-1-thread-1" java.lang.NoSuchMethodError: 
com.datastax.driver.core.TableMetadata.getIndexes()Ljava/util/List;
at com.datastax.spark.connector.cql.Schema$.getIndexMap(Schema.scala:193)
at 
com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchPartitionKey(Schema.scala:197)
at 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:239)
at 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:238)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at 
com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:238)
at 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:247)
at 
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:246)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at 
com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:246)
at 
com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:252)
at 
com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:249)
at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)
at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:120)
at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at 
com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at 
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at 
com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:120)
at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:249)
at 
com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:263)
at 
com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:36)
at 
com.cisco.ss.etl.utils.ETLHelper$class.persistBackupConfigDevicesData(ETLHelper.scala:79)
at com.cisco.ss.etl.Main$.persistBackupConfigDevicesData(Main.scala:13)
at 
com.cisco.ss.etl.utils.ETLHelper$class.persistByBacthes(ETLHelper.scala:43)
at com.cisco.ss.etl.Main$.persistByBacthes(Main.scala:13)
at com.cisco.ss.etl.Main$$anonfun$runJob$3.apply(Main.scala:48)
at com.cisco.ss.etl.Main$$anonfun$runJob$3.apply(Main.scala:45)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at com.cisco.ss.etl.Main$.runJob(Main.scala:45)
at com.cisco.ss.etl.Main$.runJob(Main.scala:13)
at 

RE: Spark 1.5.2 memory error

2016-02-03 Thread Mohammed Guller
Nirav,
Sorry to hear about your experience with Spark; however, sucks is a very strong 
word. Many organizations are processing a lot more than 150GB of data  with 
Spark.

Mohammed
Author: Big Data Analytics with 
Spark

From: Nirav Patel [mailto:npa...@xactlycorp.com]
Sent: Wednesday, February 3, 2016 11:31 AM
To: Stefan Panayotov
Cc: Jim Green; Ted Yu; Jakob Odersky; user@spark.apache.org
Subject: Re: Spark 1.5.2 memory error

Hi Stefan,

Welcome to the OOM - heap space club. I have been struggling with similar 
errors (OOM and yarn executor being killed) and failing job or sending it in 
retry loops. I bet the same job will run perfectly fine with less resource on 
Hadoop MapReduce program. I have tested it for my program and it does work.

Bottomline from my experience. Spark sucks with memory management when job is 
processing large (not huge) amount of data. It's failing for me with 16gb 
executors, 10 executors, 6 threads each. And data its processing is only 150GB! 
It's 1 billion rows for me. Same job works perfectly fine with 1 million rows.

Hope that saves you some trouble.

Nirav



On Wed, Feb 3, 2016 at 11:00 AM, Stefan Panayotov 
> wrote:
I drastically increased the memory:

spark.executor.memory = 50g
spark.driver.memory = 8g
spark.driver.maxResultSize = 8g
spark.yarn.executor.memoryOverhead = 768

I still see executors killed, but this time the memory does not seem to be the 
issue.
The error on the Jupyter notebook is:



Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.

: org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: java.io.IOException: Failed to connect to 
/10.0.0.9:48755

From nodemanagers log corresponding to worker 10.0.0.9:

2016-02-03 17:31:44,917 INFO  yarn.YarnShuffleService 
(YarnShuffleService.java:initializeApplication(129)) - Initializing application 
application_1454509557526_0014

2016-02-03 17:31:44,918 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_93 transitioned from LOCALIZING to LOCALIZED

2016-02-03 17:31:44,947 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_93 transitioned from LOCALIZED to RUNNING

2016-02-03 17:31:44,951 INFO  nodemanager.DefaultContainerExecutor 
(DefaultContainerExecutor.java:buildCommandExecutor(267)) - launchContainer: 
[bash, 
/mnt/resource/hadoop/yarn/local/usercache/root/appcache/application_1454509557526_0014/container_1454509557526_0014_01_93/default_container_executor.sh]

2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:run(371)) - Starting resource-monitoring for 
container_1454509557526_0014_01_93

2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for 
container_1454509557526_0014_01_11



Then I can see the memory usage increasing from 230.6 MB to 12.6 GB, which is 
far below 50g, and the suddenly getting killed!?!



2016-02-03 17:33:17,350 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 30962 for 
container-id container_1454509557526_0014_01_93: 12.6 GB of 51 GB physical 
memory used; 52.8 GB of 107.1 GB virtual memory used

2016-02-03 17:33:17,613 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_93 transitioned from RUNNING to KILLING

2016-02-03 17:33:17,613 INFO  launcher.ContainerLaunch 
(ContainerLaunch.java:cleanupContainer(370)) - Cleaning up container 
container_1454509557526_0014_01_93

2016-02-03 17:33:17,629 WARN  nodemanager.DefaultContainerExecutor 
(DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container 
container_1454509557526_0014_01_93 is : 143

2016-02-03 17:33:17,667 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_93 transitioned from KILLING to 
CONTAINER_CLEANEDUP_AFTER_KILL

2016-02-03 17:33:17,669 INFO  nodemanager.NMAuditLogger 
(NMAuditLogger.java:logSuccess(89)) - USER=root   OPERATION=Container 
Finished - KilledTARGET=ContainerImpl RESULT=SUCCESS   
APPID=application_1454509557526_0014 
CONTAINERID=container_1454509557526_0014_01_93

2016-02-03 17:33:17,670 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_93 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE

2016-02-03 17:33:17,670 INFO  application.ApplicationImpl 
(ApplicationImpl.java:transition(347)) - Removing 
container_1454509557526_0014_01_93 from application 
application_1454509557526_0014


RE: Cassandra BEGIN BATCH

2016-02-03 Thread Mohammed Guller
Frank,
I don’t think so. Cassandra does not support transactions in the traditional 
sense. It is not an ACID compliant database.

Mohammed
Author: Big Data Analytics with 
Spark

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, February 3, 2016 2:55 PM
To: FrankFlaherty
Cc: user
Subject: Re: Cassandra BEGIN BATCH

Seems you can find faster response on Cassandra Connector mailing list.

On Wed, Feb 3, 2016 at 1:45 PM, FrankFlaherty 
> wrote:
Cassandra provides "BEGIN BATCH" and "APPLY BATCH" to perform atomic
execution of multiple statements as below:

BEGIN BATCH
  INSERT INTO "user_status_updates"
("username", "id", "body")
  VALUES(
'dave',
16e2f240-2afa-11e4-8069-5f98e903bf02,
'dave update 4'
);

  INSERT INTO "home_status_updates" (
"timeline_username",
"status_update_id",
"status_update_username",
"body")
  VALUES (
'alice',
16e2f240-2afa-11e4-8069-5f98e903bf02,
'dave',
'dave update 4'
  );
APPLY BATCH;

Is there a way to update two or more Cassandra tables atomically using the
Cassandra Connector from Spark?

Thanks,
Frank





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-BEGIN-BATCH-tp26145.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org



RE: how to introduce spark to your colleague if he has no background about *** spark related

2016-02-02 Thread Mohammed Guller
Hi Charles,

You may find slides 16-20 from this deck useful:
http://www.slideshare.net/mg007/big-data-trends-challenges-opportunities-57744483

I used it for a talk that I gave to MS students last week. I wanted to give 
them some context before describing Spark.

It doesn’t cover all the stuff that you have on your agenda, but I would be 
happy to guide you. Feel free to send me a direct email.

Mohammed
Author: Big Data Analytics with 
Spark

From: Xiao Li [mailto:gatorsm...@gmail.com]
Sent: Sunday, January 31, 2016 10:41 PM
To: Jörn Franke
Cc: charles li; user
Subject: Re: how to introduce spark to your colleague if he has no background 
about *** spark related

My 2 cents. Concepts are always boring to the people with zero background. Use 
examples to show how easy and powerful Spark is! Use cases are also useful for 
them. Downloaded the slides in Spark summit. I believe you can find a lot of 
interesting ideas!

Tomorrow, I am facing similar issues, but the audiences are three RDBMS engine 
experts. I will go over the paper Spark SQL in Sigmod 2015 with them and show 
them the source codes.

Good luck!

Xiao Li

2016-01-31 22:35 GMT-08:00 Jörn Franke 
>:
It depends of course on the background of the people but how about some 
examples ("word count") how it works in the background.

On 01 Feb 2016, at 07:31, charles li 
> wrote:

Apache Spark™ is a fast and general engine for large-scale data processing.

it's a good profile of spark, but it's really too short for lots of people if 
then have little background in this field.

ok, frankly, I'll give a tech-talk about spark later this week, and now I'm 
writing a slide about that, but I'm stuck at the first slide.


I'm going to talk about three question about spark in the first part of my 
talk, for most of my colleagues has no background on spark, hadoop, so I want 
to talk :

1. the background of birth of spark
2. pros and cons of spark, or the situations that spark is going to handle, or 
why we use spark
3. the basic principles of spark,
4. the basic conceptions of spark

have anyone met kinds of this problem, introduce spark to one who has no 
background on your field? and I hope you can tell me how you handle this 
problem at that time, or give some ideas about the 4 sections mentioned above.


great thanks.


--
--
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao



RE: saveAsTextFile is not writing to local fs

2016-02-01 Thread Mohammed Guller
If the data is not too big, one option is to call the collect method and then 
save the result to a local file using standard Java/Scala API. However, keep in 
mind that this will transfer data from all the worker nodes to the driver 
program. Looks like that is what you want to do anyway, but you need to be 
aware of how big that data is and related implications.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com]
Sent: Monday, February 1, 2016 6:00 PM
To: Mohammed Guller
Cc: spark users
Subject: Re: saveAsTextFile is not writing to local fs

Hi Mohamed,

Thanks for your response. Data is available in worker nodes. But looking for 
something to write directly to local fs. Seems like it is not an option.

Thanks,
Sivakumar Bhavanari.

On Mon, Feb 1, 2016 at 5:45 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
You should not be saving an RDD to local FS if Spark is running on a real 
cluster. Essentially, each Spark worker will save the partitions that it 
processes locally.

Check the directories on the worker nodes and you should find pieces of your 
file on each node.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com<mailto:sbhavan...@gmail.com>]
Sent: Friday, January 29, 2016 5:40 PM
To: Mohammed Guller
Cc: spark users
Subject: Re: saveAsTextFile is not writing to local fs

Hi Mohammed,

Thanks for your quick response. I m submitting spark job to Yarn in 
"yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG mode. 
I see the below exception, but this exception occurred after saveAsTextfile 
function is finished.

16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
org.spark-project.jetty.io.EofException

Do you think this one this causing this?

Thanks,
Sivakumar Bhavanari.

On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Is it a multi-node cluster or you running Spark on a single machine?

You can change Spark’s logging level to INFO or DEBUG to see what is going on.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com<mailto:sbhavan...@gmail.com>]
Sent: Friday, January 29, 201

RE: saveAsTextFile is not writing to local fs

2016-02-01 Thread Mohammed Guller
You should not be saving an RDD to local FS if Spark is running on a real 
cluster. Essentially, each Spark worker will save the partitions that it 
processes locally.

Check the directories on the worker nodes and you should find pieces of your 
file on each node.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com]
Sent: Friday, January 29, 2016 5:40 PM
To: Mohammed Guller
Cc: spark users
Subject: Re: saveAsTextFile is not writing to local fs

Hi Mohammed,

Thanks for your quick response. I m submitting spark job to Yarn in 
"yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG mode. 
I see the below exception, but this exception occurred after saveAsTextfile 
function is finished.

16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at 
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
org.spark-project.jetty.io.EofException

Do you think this one this causing this?

Thanks,
Sivakumar Bhavanari.

On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Is it a multi-node cluster or you running Spark on a single machine?

You can change Spark’s logging level to INFO or DEBUG to see what is going on.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Siva [mailto:sbhavan...@gmail.com<mailto:sbhavan...@gmail.com>]
Sent: Friday, January 29, 2016 3:38 PM
To: spark users
Subject: saveAsTextFile is not writing to local fs

Hi Everyone,

We are using spark 1.4.1 and we have a requirement of writing data local fs 
instead of hdfs.

When trying to save rdd to local fs with saveAsTextFile, it is just writing 
_SUCCESS file in the folder with no part- files and also no error or warning 
messages on console.

Is there any place to look at to fix this problem?

Thanks,
Sivakumar Bhavanari.



RE: saveAsTextFile is not writing to local fs

2016-01-29 Thread Mohammed Guller
Is it a multi-node cluster or you running Spark on a single machine?

You can change Spark’s logging level to INFO or DEBUG to see what is going on.

Mohammed
Author: Big Data Analytics with 
Spark

From: Siva [mailto:sbhavan...@gmail.com]
Sent: Friday, January 29, 2016 3:38 PM
To: spark users
Subject: saveAsTextFile is not writing to local fs

Hi Everyone,

We are using spark 1.4.1 and we have a requirement of writing data local fs 
instead of hdfs.

When trying to save rdd to local fs with saveAsTextFile, it is just writing 
_SUCCESS file in the folder with no part- files and also no error or warning 
messages on console.

Is there any place to look at to fix this problem?

Thanks,
Sivakumar Bhavanari.


RE: JSON to SQL

2016-01-28 Thread Mohammed Guller
You don’t need Hive for that. The DataFrame class has a method  named explode, 
which provides the same functionality.

Here is an example from the Spark API documentation:
df.explode("words", "word"){words: String => words.split(" ")}

The first argument to the explode method  is the name of the input column and 
the second argument is the name of the output column.

Mohammed
Author: Big Data Analytics with 
Spark

From: Andrés Ivaldi [mailto:iaiva...@gmail.com]
Sent: Wednesday, January 27, 2016 7:17 PM
To: Cheng, Hao
Cc: Sahil Sareen; Al Pivonka; user
Subject: Re: JSON to SQL

I'm using DataFrames reading the JSON exactly as you say, and I can get the 
scheme from there. Reading the documentation, I realized that is possible to 
create Dynamically a Structure, so applying some transformations to the 
dataFrame plus the new structure I'll be able to save the JSON on my DBRM.

For the flatten approach, you mentioned LateralView, do I need Hive DB for 
that? or just the Spark Hive Context? I saw some examples and that is exactly 
what I'm needing. Can you explain it a little bit more?

Thanks

On Wed, Jan 27, 2016 at 10:29 PM, Cheng, Hao 
> wrote:
Have you ever try the DataFrame API like: 
sqlContext.read.json("/path/to/file.json"); the Spark SQL will auto infer the 
type/schema for you.

And lateral view will help on the flatten issues,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView, as 
well as the “a.b[0].c” format of expression.


From: Andrés Ivaldi [mailto:iaiva...@gmail.com]
Sent: Thursday, January 28, 2016 3:39 AM
To: Sahil Sareen
Cc: Al Pivonka; user
Subject: Re: JSON to SQL

I'm really brand new with Scala, but if I'm defining a case class then is 
becouse I know how is the json's structure is previously?

If I'm able to define dinamicaly a case class from the JSON structure then even 
with spark I will be able to extract the data

On Wed, Jan 27, 2016 at 4:01 PM, Sahil Sareen 
> wrote:
Isn't this just about defining a case class and using 
parse(json).extract[CaseClassName]  using Jackson?

-Sahil

On Wed, Jan 27, 2016 at 11:08 PM, Andrés Ivaldi 
> wrote:
We dont have Domain Objects, its a service like a pipeline, data is read  from 
source and they are saved it in relational Database

I can read the structure from DataFrames, and do some transformations, I would 
prefer to do it with Spark to be consistent with the process

On Wed, Jan 27, 2016 at 12:25 PM, Al Pivonka 
> wrote:
Are you using an Relational Database?
If so why not use a nojs DB ? then pull from it to your relational?

Or utilize a library that understands Json structure like Jackson to obtain the 
data from the Json structure the persist the Domain Objects ?

On Wed, Jan 27, 2016 at 9:45 AM, Andrés Ivaldi 
> wrote:
Sure,
The Job is like an etl, but without interface, so I decide the rules of how the 
JSON will be saved into a SQL Table.

I need to Flatten the hierarchies where is possible in case of list flatten 
also, nested objects Won't be processed by now

{"a":1,"b":[2,3],"c"="Field", "d":[4,5,6,7,8] }
{"a":11,"b":[22,33],"c"="Field1", "d":[44,55,66,77,88] }
{"a":111,"b":[222,333],"c"="Field2", "d":[44,55,666,777,888] }

I would like something like this on my SQL table
ab  c d
12,3Field 4,5,6,7,8
11   22,33  Field144,55,66,77,88
111  222,333Field2444,555,,666,777,888
Right now this is what i need
I will later add more intelligence, like detection of list or nested objects 
and create relations in other tables.



On Wed, Jan 27, 2016 at 11:25 AM, Al Pivonka 
> wrote:
More detail is needed.
Can you provide some context to the use-case ?

On Wed, Jan 27, 2016 at 8:33 AM, Andrés Ivaldi 
> wrote:
Hello, I'm trying to Save a JSON filo into SQL table.

If i try to do this directly the IlligalArgumentException is raised, I suppose 
this is beacouse JSON have a hierarchical structure, is that correct?

If that is the problem, how can I flatten the JSON structure? The JSON 
structure to be processed would be unknow, so I need to do it programatically

regards
--
Ing. Ivaldi Andres



--
Those who say it can't be done, are usually interrupted by those doing it.


--
Ing. Ivaldi Andres



--
Those who say it can't be done, are usually interrupted by those doing it.


--
Ing. Ivaldi Andres




--
Ing. Ivaldi Andres



--
Ing. Ivaldi Andres


RE: a question about web ui log

2016-01-26 Thread Mohammed Guller
If the application history is turned on, it should work, even through ssh 
tunnel. Can you elaborate on what you mean by “it does not work?”

Also, are you able to see the application web UI while an application is 
executing a job?

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Philip Lee [mailto:philjj...@gmail.com]
Sent: Tuesday, January 26, 2016 5:12 AM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: a question about web ui log

Yes, I tried it, but it simply does not work.

so, my concern is to use "ssh tunnel" to forward a port of cluster to localhost 
port.

But in Spark UI, there are two ports which I should forward using "ssh tunnel".
Considering a default port, 8080 is web-ui port to come into web-ui, and 4040 
is web-monitoring port to see time execution like DAG in application details 
UI, as you probably know.

But after finishing a job, I can see the list of a job on the web-ui on 8080, 
but when I click "application details UI" on port 4040 to see time excution, it 
does not work.

Any suggestion? I really need to see execution of DAG.

Best,
Phil

On Tue, Jan 26, 2016 at 12:04 AM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
I am not sure whether you can copy the log files from Spark workers to your 
local machine and view it from the Web UI. In fact, if you are able to copy the 
log files locally, you can just view them directly in any text editor.

I suspect what you really want to see is the application history. Here is the 
relevant information from Spark’s monitoring page 
(http://spark.apache.org/docs/latest/monitoring.html)

To view the web UI after the fact, set spark.eventLog.enabled to true before 
starting the application. This configures Spark to log Spark events that encode 
the information displayed in the UI to persisted storage.

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Philip Lee [mailto:philjj...@gmail.com<mailto:philjj...@gmail.com>]
Sent: Monday, January 25, 2016 9:51 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: a question about web ui log

As I mentioned before, I am tryint to see the spark log on a cluster via 
ssh-tunnel

1) The error on application details UI is probably from monitoring porting 
​4044. Web UI port is 8088, right? so how could I see job web ui view and 
application details UI view in the web ui on my local machine?

2) still wondering how to see the log after copyting log file to my local.

The error was metioned in previous mail.

Thanks,
Phil



On Mon, Jan 25, 2016 at 5:36 PM, Philip Lee 
<philjj...@gmail.com<mailto:philjj...@gmail.com>> wrote:
​Hello, a questino about web UI log.

​I could see web interface log after forwarding the port on my cluster to my 
local and click completed application, but when I clicked "application detail 
UI"

[Inline image 1]

It happened to me. I do not know why. I also checked the specific log folder. 
It has a log file in it. Actually, that's why I could click the completed 
application link, right?

So is it okay for me to copy the log file in my cluster to my local machine.
And after turning on spark Job Manger on my local by myself, I could see 
application deatils UI in my local machine?

Best,
Phil




RE: withColumn

2016-01-26 Thread Mohammed Guller
Naga –
I believe that the second argument to the withColumn method has to be a column 
calculated from the source DataFrame on which you call that method. The 
following will work:

df2.withColumn("age2", $"age"+10)


Mohammed
Author: Big Data Analytics with 
Spark

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, January 26, 2016 1:45 PM
To: naga sharathrayapati
Cc: user
Subject: Re: withColumn

A brief search among the Spark source code showed no support for referencing 
column the way shown in your code.

Are you trying to do a join ?

Cheers

On Tue, Jan 26, 2016 at 1:04 PM, naga sharathrayapati 
> wrote:

I was trying to append a Column to a dataframe df2 by using 'withColumn'(as 
shown below), can anyone help me understand what went wrong?



scala> case class Sharath(name1: String, age1: Long)

defined class Sharath

scala> val df1 = Seq(Sharath("Sharath", 29)).toDF()

df1: org.apache.spark.sql.DataFrame = [name1: string, age1: bigint]

scala> case class Sunil(name: String, age: Long)

defined class Sunil

scala> val df2 = Seq(Sunil("Sunil", 33)).toDF()

df2: org.apache.spark.sql.DataFrame = [name: string, age: bigint]

scala> df2.withColumn("agess",df1("name1"))

org.apache.spark.sql.AnalysisException: resolved attribute(s) name1#0 missing 
from name#2,age#3L in operator !Project [name#2,age#3L,name1#0 AS agess#4];

at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)



RE: a question about web ui log

2016-01-25 Thread Mohammed Guller
I am not sure whether you can copy the log files from Spark workers to your 
local machine and view it from the Web UI. In fact, if you are able to copy the 
log files locally, you can just view them directly in any text editor.

I suspect what you really want to see is the application history. Here is the 
relevant information from Spark’s monitoring page 
(http://spark.apache.org/docs/latest/monitoring.html)

To view the web UI after the fact, set spark.eventLog.enabled to true before 
starting the application. This configures Spark to log Spark events that encode 
the information displayed in the UI to persisted storage.

Mohammed
Author: Big Data Analytics with 
Spark

From: Philip Lee [mailto:philjj...@gmail.com]
Sent: Monday, January 25, 2016 9:51 AM
To: user@spark.apache.org
Subject: Re: a question about web ui log

As I mentioned before, I am tryint to see the spark log on a cluster via 
ssh-tunnel

1) The error on application details UI is probably from monitoring porting 
​4044. Web UI port is 8088, right? so how could I see job web ui view and 
application details UI view in the web ui on my local machine?

2) still wondering how to see the log after copyting log file to my local.

The error was metioned in previous mail.

Thanks,
Phil



On Mon, Jan 25, 2016 at 5:36 PM, Philip Lee 
> wrote:
​Hello, a questino about web UI log.

​I could see web interface log after forwarding the port on my cluster to my 
local and click completed application, but when I clicked "application detail 
UI"

[Inline image 1]

It happened to me. I do not know why. I also checked the specific log folder. 
It has a log file in it. Actually, that's why I could click the completed 
application link, right?

So is it okay for me to copy the log file in my cluster to my local machine.
And after turning on spark Job Manger on my local by myself, I could see 
application deatils UI in my local machine?

Best,
Phil



RE: Spark Cassandra clusters

2016-01-22 Thread Mohammed Guller
Vivek,

By default, Cassandra uses ¼ of the system memory, so in your case, it will be 
around 8GB, which is fine.

If you have more Cassandra related question, it is better to post it on the 
Cassandra mailing list. Also feel free to email me directly.

Mohammed
Author: Big Data Analytics with 
Spark

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, January 22, 2016 6:37 PM
To: vivek.meghanat...@wipro.com
Cc: user
Subject: Re: Spark Cassandra clusters

I am not Cassandra developer :-)

Can you use http://search-hadoop.com/ or ask on Cassandra mailing list.

Cheers

On Fri, Jan 22, 2016 at 6:35 PM, 
> wrote:

Thanks Ted, also what is the suggested memory setting for Cassandra process?

Regards
Vivek
On Sat, Jan 23, 2016 at 7:57 am, Ted Yu 
> wrote:

From your description, putting Cassandra daemon on Spark cluster should be 
feasible.

One aspect to be measured is how much locality can be achieved in this setup - 
Cassandra is distributed NoSQL store.

Cheers

On Fri, Jan 22, 2016 at 6:13 PM, 
> wrote:

+ spark standalone cluster
On Sat, Jan 23, 2016 at 7:33 am, Vivek Meghanathan (WT01 - NEP) 
> wrote:


We have the setup on Google cloud platform. Each node has 8 CPU + 30GB memory. 
10 nodes for spark another 9nodes for Cassandra.
We are using spark 1.3.0 and Datastax bundle 4.5.9(which has 2.0.x Cassandra).
Spark master and worker daemon uses Xmx & Xms 4G. We have not changed the 
default setting of Cassandra, should we be increasing the JVM memory?

we have 9 streaming jobs the core usage varies from 2-6 and memory usage from 1 
- 4 gb.

We have budget to use higher CPU or higher memory systems hence was planning to 
have them together on more efficient nodes.

Regards
Vivek
On Sat, Jan 23, 2016 at 7:13 am, Ted Yu 
> wrote:

Can you give us a bit more information ?

How much memory does each node have ?
What's the current heap allocation for Cassandra process and executor ?
Spark / Cassandra release you are using

Thanks

On Fri, Jan 22, 2016 at 5:37 PM, 
> wrote:

Hi All,
What is the right spark Cassandra cluster setup - having Cassandra cluster and 
spark cluster in different nodes or they should be on same nodes.
We are having them in different nodes and performance test shows very bad 
result for the spark streaming jobs.
Please let us know.

Regards
Vivek
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com



RE: Date / time stuff with spark.

2016-01-22 Thread Mohammed Guller
Hi Andrew,

Here is another option.

You can define custom schema to specify the correct type for the time column as 
shown below:

import org.apache.spark.sql.types._

val customSchema =
  StructType(
StructField("a", IntegerType, false) ::
StructField("b", LongType, false) ::
StructField("Id", StringType, false) ::
StructField("smunkId", StringType, false) ::
StructField("popsicleRange", LongType, false) ::
StructField("time", TimestampType, false) :: Nil
)

val df = sqlContext.read.schema(customSchema).json("...")

You can also use the built-in date/time functions to manipulate the time column 
as shown below.

val day = df.select(dayofmonth($"time"))
val mth = df.select(month($"time"))
val prevYear = df.select(year($"time") - 1)

You can get the complete list of the date/time functions here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

Mohammed
Author: Big Data Analytics with 
Spark

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, January 22, 2016 8:01 AM
To: Spencer, Alex (Santander)
Cc: Andrew Holway; user@spark.apache.org
Subject: Re: Date / time stuff with spark.

Related thread:

http://search-hadoop.com/m/q3RTtSfi342nveex1=RE+NPE+when+using+Joda+DateTime

FYI

On Fri, Jan 22, 2016 at 6:50 AM, Spencer, Alex (Santander) 
>
 wrote:
Hi Andy,

Sorry this is in Scala but you may be able to do something similar? I use 
Joda's DateTime class. I ran into a lot of difficulties with the serializer, 
but if you are an admin on the box you'll have less issues by adding in some 
Kryo serializers.

import org.joda.time

val dateFormat = format.DateTimeFormat.forPattern("-MM-dd");
val tranDate = dateFormat.parseDateTime(someDateString)


Alex

-Original Message-
From: Andrew Holway 
[mailto:andrew.hol...@otternetworks.de]
Sent: 21 January 2016 19:25
To: user@spark.apache.org
Subject: Date / time stuff with spark.

Hello,

I am importing this data from HDFS into a data frame with 
sqlContext.read.json().

{“a": 42, “a": 56, "Id": "621368e2f829f230", “smunkId":
"CKm26sDMucoCFReRGwodbHAAgw", “popsicleRange": "17610", "time":
"2016-01-20T23:59:53+00:00”}

I want to do some date/time operations on this json data but I cannot find 
clear documentation on how to

A) specify the “time” field as a date/time in the schema.
B) the format the date should be in to be correctly in the raw data for an easy 
import.

Cheers,

Andrew

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.org
Emails aren't always secure, and they may be intercepted or changed after
they've been sent. Santander doesn't accept liability if this happens. If you
think someone may have interfered with this email, please get in touch with the
sender another way. This message doesn't create or change any contract.
Santander doesn't accept responsibility for damage caused by any viruses
contained in this email or its attachments. Emails may be monitored. If you've
received this email by mistake, please let the sender know at once that it's
gone to the wrong person and then destroy it without copying, using, or telling
anyone about its contents.
Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services plc Reg.
No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London NW1 3AN.
Registered in England. www.santander.co.uk. 
Authorised by the Prudential
Regulation Authority and regulated by the Financial Conduct Authority and the
Prudential Regulation Authority. FCA Reg. No. 106054 and 146003 respectively.
Santander Sharedealing is a trading name of Abbey Stockbrokers Limited Reg. No.
02666793. Registered Office: Kingfisher House, Radford Way, Billericay, Essex
CM12 0GZ. Authorised and regulated by the Financial Conduct Authority. FCA Reg.
No. 154210. You can check this on the Financial Services Register by visiting
the FCA’s website www.fca.org.uk/register or by 
contacting the FCA on 0800 111
6768. Santander UK plc is also licensed by the Financial Supervision Commission
of the Isle of Man for its branch in the Isle of Man. Deposits held with the
Isle of Man branch are covered by the Isle of Man Depositors’ Compensation
Scheme as set out in the Isle of Man Depositors’ Compensation Scheme Regulations
2010. In the Isle of Man, Santander UK plc’s principal place of business is at
19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and the flame logo
are registered trademarks.
Santander Asset Finance plc. Reg. No. 1533123. Registered 

RE: Is it possible to use SparkSQL JDBC ThriftServer without Hive

2016-01-15 Thread Mohammed Guller
Sambit - I believe the default Derby-based metastore allows only one active 
user at a time. You can replace it with MySQL or Postgres. 

Using the Hive Metastore enables Spark SQL to be compatible with Hive. If you 
have an existing Hive setup, you can Spark SQL to process data in your Hive 
tables. 

Mohammed

-Original Message-
From: Sambit Tripathy (RBEI/EDS1) [mailto:sambit.tripa...@in.bosch.com] 
Sent: Friday, January 15, 2016 11:30 AM
To: Mohammed Guller; angela.whelan; user@spark.apache.org
Subject: RE: Is it possible to use SparkSQL JDBC ThriftServer without Hive

Hi Mohammed,

I think this is something you can do at the Thrift server startup. So this 
would run an instance of Derby and act as a Metastore. Any idea if this Debry 
Metastore will have distributed access and why do we use the Hive Metastore 
then?

@Angela: I would  also be happy to have a metastore owned by Spark Thrift 
Server. What are you trying to achieve by using the Thrift server without Hive?


Regards,
Sambit.


-Original Message-
From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Wednesday, January 13, 2016 2:54 PM
To: angela.whelan <angela.whe...@synchronoss.com>; user@spark.apache.org
Subject: RE: Is it possible to use SparkSQL JDBC ThriftServer without Hive

Hi Angela,
Yes, you can use Spark SQL JDBC/ThriftServer without Hive.

Mohammed


-Original Message-
From: angela.whelan [mailto:angela.whe...@synchronoss.com]
Sent: Wednesday, January 13, 2016 3:37 AM
To: user@spark.apache.org
Subject: Is it possible to use SparkSQL JDBC ThriftServer without Hive

hi,
I'm wondering if it is possible to use the SparkSQL JDBC ThriftServer without 
Hive?

The reason I'm asking is that we are unsure about the speed of Hive with 
SparkSQL JDBC connectivity.

I can't find any article online about using SparkSQL JDBC ThriftServer without 
Hive.

Many thanks in advance for any help on this.

Thanks, Angela



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-SparkSQL-JDBC-ThriftServer-without-Hive-tp25959.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Is it possible to use SparkSQL JDBC ThriftServer without Hive

2016-01-13 Thread Mohammed Guller
Hi Angela,
Yes, you can use Spark SQL JDBC/ThriftServer without Hive.

Mohammed


-Original Message-
From: angela.whelan [mailto:angela.whe...@synchronoss.com] 
Sent: Wednesday, January 13, 2016 3:37 AM
To: user@spark.apache.org
Subject: Is it possible to use SparkSQL JDBC ThriftServer without Hive

hi,
I'm wondering if it is possible to use the SparkSQL JDBC ThriftServer without 
Hive?

The reason I'm asking is that we are unsure about the speed of Hive with 
SparkSQL JDBC connectivity.

I can't find any article online about using SparkSQL JDBC ThriftServer without 
Hive.

Many thanks in advance for any help on this.

Thanks, Angela



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-SparkSQL-JDBC-ThriftServer-without-Hive-tp25959.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark ignores SPARK_WORKER_MEMORY?

2016-01-13 Thread Mohammed Guller
Barak,

The SPARK_WORKER_MEMORYsetting is used for allocating memory to executors.

You can use SPARK_DAEMON_MEMORY to set memory for the worker JVM.

Mohammed

From: Barak Yaish [mailto:barak.ya...@gmail.com]
Sent: Wednesday, January 13, 2016 12:59 AM
To: user@spark.apache.org
Subject: Spark ignores SPARK_WORKER_MEMORY?

Hello,

Although I'm setting SPARK_WORKER_MEMORY in spark-env.sh, looks like this 
setting is ignored. I can't find any indication at the scripts under bin/sbin 
that -Xms/-Xmx are set.

If I ps the worker pid, it looks like memory set to 1G:

[hadoop@sl-env1-hadoop1 spark-1.5.2-bin-hadoop2.6]$ ps -ef | grep 20232
hadoop   20232 1  0 02:01 ?00:00:22 /usr/java/latest//bin/java -cp 
/workspace/3rd-party/spark/spark-1.5.2-bin-hadoop2.6/sbin/../conf/:/workspace/3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/workspace/3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/workspace/3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/workspace/3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/workspace/3rd-party/hadoop/2.6.3//etc/hadoop/
 -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 
spark://10.52.39.92:7077

Am I missing something?

Thanks.


RE: spark job failure - akka error Association with remote system has failed

2016-01-13 Thread Mohammed Guller
Check the entries in your /etc/hosts file.

Also check what the hostname command returns.

Mohammed

From: vivek.meghanat...@wipro.com [mailto:vivek.meghanat...@wipro.com]
Sent: Tuesday, January 12, 2016 11:36 PM
To: user@spark.apache.org
Subject: RE: spark job failure - akka error Association with remote system has 
failed

I have used master_ip as ip address and spark conf also has Ip address . But 
the following logs shows hostname. (The spark Ui shows master details in IP)


16/01/13 12:31:38 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkDriver@masternode1:36537] has failed, address is now 
gated for [5000] ms. Reason is: [Disassociated].

From: Vivek Meghanathan (WT01 - NEP)
Sent: 13 January 2016 12:18
To: user@spark.apache.org
Subject: spark job failure - akka error Association with remote system has 
failed

Hi All,
I am running spark 1.3.0 standalone cluster mode, we have rebooted the cluster 
servers (system reboot). After that the spark jobs are failing by showing 
following error (it fails within 7-8 seconds). 2 of the jobs are running fine. 
All the jobs used to be stable before the system reboot. We have not enabled 
any default configurations in the conf file other than spark-env.sh, slaves and 
log4j.properties.

Warning in the master log:

16/01/13 11:58:16 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkDriver@masternode1:41419] has failed, address is now 
gated for [5000] ms. Reason is: [Disassociated].

Regards,
Vivek M
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com


RE: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Mohammed Guller
Did you mean Hive or Spark SQL JDBC/ODBC server?

Mohammed

From: Bryan Jeffrey [mailto:bryan.jeff...@gmail.com]
Sent: Thursday, November 12, 2015 9:12 AM
To: Mohammed Guller
Cc: user
Subject: Re: Cassandra via SparkSQL/Hive JDBC

Mohammed,

That is great.  It looks like a perfect scenario. Would I be able to make the 
created DF queryable over the Hive JDBC/ODBC server?

Regards,

Bryan Jeffrey

On Wed, Nov 11, 2015 at 9:34 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Short answer: yes.

The Spark Cassandra Connector supports the data source API. So you can create a 
DataFrame that points directly to a Cassandra table. You can query it using the 
DataFrame API or the SQL/HiveQL interface.

If you want to see an example,  see slide# 27 and 28 in this deck that I 
presented at the Cassandra Summit 2015:
http://www.slideshare.net/mg007/ad-hoc-analytics-with-cassandra-and-spark


Mohammed

From: Bryan [mailto:bryan.jeff...@gmail.com<mailto:bryan.jeff...@gmail.com>]
Sent: Tuesday, November 10, 2015 7:42 PM
To: Bryan Jeffrey; user
Subject: RE: Cassandra via SparkSQL/Hive JDBC

Anyone have thoughts or a similar use-case for SparkSQL / Cassandra?

Regards,

Bryan Jeffrey

From: Bryan Jeffrey<mailto:bryan.jeff...@gmail.com>
Sent: ‎11/‎4/‎2015 11:16 AM
To: user<mailto:user@spark.apache.org>
Subject: Cassandra via SparkSQL/Hive JDBC
Hello.

I have been working to add SparkSQL HDFS support to our application.  We're 
able to process streaming data, append to a persistent Hive table, and have 
that table available via JDBC/ODBC.  Now we're looking to access data in 
Cassandra via SparkSQL.

In reading a number of previous posts, it appears that the way to do this is to 
instantiate a Spark Context, read the data into an RDD using the Cassandra 
Spark Connector, convert the data to a DF and register it as a temporary table. 
 The data will then be accessible via SparkSQL - although I assume that you 
would need to refresh the table on a periodic basis.

Is there a more straightforward way to do this?  Is it possible to register the 
Cassandra table with Hive so that the SparkSQL thrift server instance can just 
read data directly?

Regards,

Bryan Jeffrey



RE: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Mohammed Guller
Hi Bryan,

Yes, you can query a real Cassandra cluster. You just need to provide the 
address of the Cassandra seed node.

Looks like you figured out the answer. You can also put the C* seed node 
address in the spark-defaults.conf file under the SPARK_HOME/conf directory. 
Then you don’t need to manually SET it for each Beeline session.

Mohammed

From: Bryan Jeffrey [mailto:bryan.jeff...@gmail.com]
Sent: Thursday, November 12, 2015 10:26 AM
To: Mohammed Guller
Cc: user
Subject: Re: Cassandra via SparkSQL/Hive JDBC

Answer: In beeline run the following: SET 
spark.cassandra.connection.host="10.0.0.10"

On Thu, Nov 12, 2015 at 1:13 PM, Bryan Jeffrey 
<bryan.jeff...@gmail.com<mailto:bryan.jeff...@gmail.com>> wrote:
Mohammed,

While you're willing to answer questions, is there a trick to getting the Hive 
Thrift server to connect to remote Cassandra instances?

0: jdbc:hive2://localhost:1> SET 
spark.cassandra.connection.host="cassandrahost";
SET spark.cassandra.connection.host="cassandrahost";
+---+
|   |
+---+
| spark.cassandra.connection.host="cassandrahost"  |
+---+
1 row selected (0.018 seconds)
0: jdbc:hive2://localhost:1> create temporary table cdr using 
org.apache.spark.sql.cassandra OPTIONS ( keyspace "c2", table "detectionresult" 
);
create temporary table cdr using org.apache.spark.sql.cassandra OPTIONS ( 
keyspace "c2", table "detectionresult" );
]Error: java.io.IOException: Failed to open native connection to Cassandra at 
{10.0.0.4}:9042 (state=,code=0)

This seems to be connecting to local host regardless of the value I set 
spark.cassandra.connection.host to.

Regards,

Bryan Jeffrey

On Thu, Nov 12, 2015 at 12:54 PM, Bryan Jeffrey 
<bryan.jeff...@gmail.com<mailto:bryan.jeff...@gmail.com>> wrote:
Yes, I do - I found your example of doing that later in your slides.  Thank you 
for your help!

On Thu, Nov 12, 2015 at 12:20 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Did you mean Hive or Spark SQL JDBC/ODBC server?

Mohammed

From: Bryan Jeffrey 
[mailto:bryan.jeff...@gmail.com<mailto:bryan.jeff...@gmail.com>]
Sent: Thursday, November 12, 2015 9:12 AM
To: Mohammed Guller
Cc: user
Subject: Re: Cassandra via SparkSQL/Hive JDBC

Mohammed,

That is great.  It looks like a perfect scenario. Would I be able to make the 
created DF queryable over the Hive JDBC/ODBC server?

Regards,

Bryan Jeffrey

On Wed, Nov 11, 2015 at 9:34 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Short answer: yes.

The Spark Cassandra Connector supports the data source API. So you can create a 
DataFrame that points directly to a Cassandra table. You can query it using the 
DataFrame API or the SQL/HiveQL interface.

If you want to see an example,  see slide# 27 and 28 in this deck that I 
presented at the Cassandra Summit 2015:
http://www.slideshare.net/mg007/ad-hoc-analytics-with-cassandra-and-spark


Mohammed

From: Bryan [mailto:bryan.jeff...@gmail.com<mailto:bryan.jeff...@gmail.com>]
Sent: Tuesday, November 10, 2015 7:42 PM
To: Bryan Jeffrey; user
Subject: RE: Cassandra via SparkSQL/Hive JDBC

Anyone have thoughts or a similar use-case for SparkSQL / Cassandra?

Regards,

Bryan Jeffrey

From: Bryan Jeffrey<mailto:bryan.jeff...@gmail.com>
Sent: ‎11/‎4/‎2015 11:16 AM
To: user<mailto:user@spark.apache.org>
Subject: Cassandra via SparkSQL/Hive JDBC
Hello.

I have been working to add SparkSQL HDFS support to our application.  We're 
able to process streaming data, append to a persistent Hive table, and have 
that table available via JDBC/ODBC.  Now we're looking to access data in 
Cassandra via SparkSQL.

In reading a number of previous posts, it appears that the way to do this is to 
instantiate a Spark Context, read the data into an RDD using the Cassandra 
Spark Connector, convert the data to a DF and register it as a temporary table. 
 The data will then be accessible via SparkSQL - although I assume that you 
would need to refresh the table on a periodic basis.

Is there a more straightforward way to do this?  Is it possible to register the 
Cassandra table with Hive so that the SparkSQL thrift server instance can just 
read data directly?

Regards,

Bryan Jeffrey






RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread Mohammed Guller
Short answer: yes.

The Spark Cassandra Connector supports the data source API. So you can create a 
DataFrame that points directly to a Cassandra table. You can query it using the 
DataFrame API or the SQL/HiveQL interface.

If you want to see an example,  see slide# 27 and 28 in this deck that I 
presented at the Cassandra Summit 2015:
http://www.slideshare.net/mg007/ad-hoc-analytics-with-cassandra-and-spark


Mohammed

From: Bryan [mailto:bryan.jeff...@gmail.com]
Sent: Tuesday, November 10, 2015 7:42 PM
To: Bryan Jeffrey; user
Subject: RE: Cassandra via SparkSQL/Hive JDBC

Anyone have thoughts or a similar use-case for SparkSQL / Cassandra?

Regards,

Bryan Jeffrey

From: Bryan Jeffrey
Sent: ‎11/‎4/‎2015 11:16 AM
To: user
Subject: Cassandra via SparkSQL/Hive JDBC
Hello.

I have been working to add SparkSQL HDFS support to our application.  We're 
able to process streaming data, append to a persistent Hive table, and have 
that table available via JDBC/ODBC.  Now we're looking to access data in 
Cassandra via SparkSQL.

In reading a number of previous posts, it appears that the way to do this is to 
instantiate a Spark Context, read the data into an RDD using the Cassandra 
Spark Connector, convert the data to a DF and register it as a temporary table. 
 The data will then be accessible via SparkSQL - although I assume that you 
would need to refresh the table on a periodic basis.

Is there a more straightforward way to do this?  Is it possible to register the 
Cassandra table with Hive so that the SparkSQL thrift server instance can just 
read data directly?

Regards,

Bryan Jeffrey


RE: Spark SQL Thriftserver and Hive UDF in Production

2015-10-18 Thread Mohammed Guller
Have you tried registering the function using the Beeline client?

Another alternative would be to create a Spark SQL UDF and launch the Spark SQL 
Thrift server programmatically.

Mohammed

-Original Message-
From: ReeceRobinson [mailto:re...@therobinsons.gen.nz] 
Sent: Sunday, October 18, 2015 8:05 PM
To: user@spark.apache.org
Subject: Spark SQL Thriftserver and Hive UDF in Production

Does anyone have some advice on the best way to deploy a Hive UDF for use with 
a Spark SQL Thriftserver where the client is Tableau using Simba ODBC Spark SQL 
driver.

I have seen the hive documentation that provides an example of creating the 
function using a hive client ie: CREATE FUNCTION myfunc AS 'myclass' USING JAR 
'hdfs:///path/to/jar';

However using Tableau I can't run this create function statement to register my 
UDF. Ideally there is a configuration setting that will load my UDF jar and 
register it at start-up of the thriftserver.

Can anyone tell me what the best option if it is possible?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thriftserver-and-Hive-UDF-in-Production-tp25114.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: dataframes and numPartitions

2015-10-15 Thread Mohammed Guller
You may find the spark.sql.shuffle.partitions property useful. The default 
value is 200.

Mohammed

From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
Sent: Wednesday, October 14, 2015 8:14 PM
To: user
Subject: dataframes and numPartitions

A lot of RDD methods take a numPartitions parameter that lets you specify the 
number of partitions in the result. For example, groupByKey.

The DataFrame counterparts don't have a numPartitions parameter, e.g. groupBy 
only takes a bunch of Columns as params.

I understand that the DataFrame API is supposed to be smarter and go through a 
LogicalPlan, and perhaps determine the number of optimal partitions for you, 
but sometimes you want to specify the number of partitions yourself. One such 
use case is when you are preparing to do a "merge" join with another dataset 
that is similarly partitioned with the same number of partitions.


RE: laziness in textFile reading from HDFS?

2015-10-06 Thread Mohammed Guller
I have not used LZO compressed files from Spark, so not sure why it stalls 
without caching. 

In general, if you are going to make just one pass over the data, there is not 
much benefit in caching it. The data gets read anyway only after the first 
action is called. If you are calling just a map operation and then a save 
operation, I don't see how caching would help.

Mohammed


-Original Message-
From: Matt Narrell [mailto:matt.narr...@gmail.com] 
Sent: Tuesday, October 6, 2015 3:32 PM
To: Mohammed Guller
Cc: davidkl; user@spark.apache.org
Subject: Re: laziness in textFile reading from HDFS?

One.

I read in LZO compressed files from HDFS Perform a map operation cache the 
results of this map operation call saveAsHadoopFile to write LZO back to HDFS.

Without the cache, the job will stall.  

mn

> On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
> 
> Is there any specific reason for caching the RDD? How many passes you make 
> over the dataset? 
> 
> Mohammed
> 
> -Original Message-
> From: Matt Narrell [mailto:matt.narr...@gmail.com]
> Sent: Saturday, October 3, 2015 9:50 PM
> To: Mohammed Guller
> Cc: davidkl; user@spark.apache.org
> Subject: Re: laziness in textFile reading from HDFS?
> 
> Is there any more information or best practices here?  I have the exact same 
> issues when reading large data sets from HDFS (larger than available RAM) and 
> I cannot run without setting the RDD persistence level to 
> MEMORY_AND_DISK_SER, and using nearly all the cluster resources.
> 
> Should I repartition this RDD to be equal to the number of cores?  
> 
> I notice that the job duration on the YARN UI is about 30 minutes longer than 
> the Spark UI.  When the job initially starts, there is no tasks shown in the 
> Spark UI..?
> 
> All I;m doing is reading records from HDFS text files with sc.textFile, and 
> rewriting them back to HDFS grouped by a timestamp.
> 
> Thanks,
> mn
> 
>> On Sep 29, 2015, at 8:06 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
>> 
>> 1) It is not required to have the same amount of memory as data. 
>> 2) By default the # of partitions are equal to the number of HDFS 
>> blocks
>> 3) Yes, the read operation is lazy
>> 4) It is okay to have more number of partitions than number of cores. 
>> 
>> Mohammed
>> 
>> -Original Message-
>> From: davidkl [mailto:davidkl...@hotmail.com]
>> Sent: Monday, September 28, 2015 1:40 AM
>> To: user@spark.apache.org
>> Subject: laziness in textFile reading from HDFS?
>> 
>> Hello,
>> 
>> I need to process a significant amount of data every day, about 4TB. This 
>> will be processed in batches of about 140GB. The cluster this will be 
>> running on doesn't have enough memory to hold the dataset at once, so I am 
>> trying to understand how this works internally.
>> 
>> When using textFile to read an HDFS folder (containing multiple files), I 
>> understand that the number of partitions created are equal to the number of 
>> HDFS blocks, correct? Are those created in a lazy way? I mean, if the number 
>> of blocks/partitions is larger than the number of cores/threads the Spark 
>> driver was launched with (N), are N partitions created initially and then 
>> the rest when required? Or are all those partitions created up front?
>> 
>> I want to avoid reading the whole data into memory just to spill it out to 
>> disk if there is no enough memory.
>> 
>> Thanks! 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textF
>> i le-reading-from-HDFS-tp24837.html Sent from the Apache Spark User 
>> List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
>> additional commands, e-mail: user-h...@spark.apache.org
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
>> additional commands, e-mail: user-h...@spark.apache.org
>> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: laziness in textFile reading from HDFS?

2015-10-06 Thread Mohammed Guller
It is not uncommon to process datasets larger than available memory with Spark. 

I don't remember whether LZO files are splittable. Perhaps, in your case Spark 
is running into issues while decompressing a large LZO file.

See if this helps:
http://stackoverflow.com/questions/25248170/spark-hadoop-throws-exception-for-large-lzo-files


Mohammed


-Original Message-
From: Matt Narrell [mailto:matt.narr...@gmail.com] 
Sent: Tuesday, October 6, 2015 4:08 PM
To: Mohammed Guller
Cc: davidkl; user@spark.apache.org
Subject: Re: laziness in textFile reading from HDFS?

Agreed. This is spark 1.2 on CDH5.x. How do you mitigate where the data sets 
are larger than available memory?

My jobs stall and gc/heap issues all over the place.  

..via mobile

> On Oct 6, 2015, at 4:44 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
> 
> I have not used LZO compressed files from Spark, so not sure why it stalls 
> without caching. 
> 
> In general, if you are going to make just one pass over the data, there is 
> not much benefit in caching it. The data gets read anyway only after the 
> first action is called. If you are calling just a map operation and then a 
> save operation, I don't see how caching would help.
> 
> Mohammed
> 
> 
> -Original Message-
> From: Matt Narrell [mailto:matt.narr...@gmail.com]
> Sent: Tuesday, October 6, 2015 3:32 PM
> To: Mohammed Guller
> Cc: davidkl; user@spark.apache.org
> Subject: Re: laziness in textFile reading from HDFS?
> 
> One.
> 
> I read in LZO compressed files from HDFS Perform a map operation cache the 
> results of this map operation call saveAsHadoopFile to write LZO back to HDFS.
> 
> Without the cache, the job will stall.  
> 
> mn
> 
>> On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
>> 
>> Is there any specific reason for caching the RDD? How many passes you make 
>> over the dataset? 
>> 
>> Mohammed
>> 
>> -Original Message-
>> From: Matt Narrell [mailto:matt.narr...@gmail.com]
>> Sent: Saturday, October 3, 2015 9:50 PM
>> To: Mohammed Guller
>> Cc: davidkl; user@spark.apache.org
>> Subject: Re: laziness in textFile reading from HDFS?
>> 
>> Is there any more information or best practices here?  I have the exact same 
>> issues when reading large data sets from HDFS (larger than available RAM) 
>> and I cannot run without setting the RDD persistence level to 
>> MEMORY_AND_DISK_SER, and using nearly all the cluster resources.
>> 
>> Should I repartition this RDD to be equal to the number of cores?  
>> 
>> I notice that the job duration on the YARN UI is about 30 minutes longer 
>> than the Spark UI.  When the job initially starts, there is no tasks shown 
>> in the Spark UI..?
>> 
>> All I;m doing is reading records from HDFS text files with sc.textFile, and 
>> rewriting them back to HDFS grouped by a timestamp.
>> 
>> Thanks,
>> mn
>> 
>>> On Sep 29, 2015, at 8:06 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
>>> 
>>> 1) It is not required to have the same amount of memory as data. 
>>> 2) By default the # of partitions are equal to the number of HDFS 
>>> blocks
>>> 3) Yes, the read operation is lazy
>>> 4) It is okay to have more number of partitions than number of cores. 
>>> 
>>> Mohammed
>>> 
>>> -Original Message-
>>> From: davidkl [mailto:davidkl...@hotmail.com]
>>> Sent: Monday, September 28, 2015 1:40 AM
>>> To: user@spark.apache.org
>>> Subject: laziness in textFile reading from HDFS?
>>> 
>>> Hello,
>>> 
>>> I need to process a significant amount of data every day, about 4TB. This 
>>> will be processed in batches of about 140GB. The cluster this will be 
>>> running on doesn't have enough memory to hold the dataset at once, so I am 
>>> trying to understand how this works internally.
>>> 
>>> When using textFile to read an HDFS folder (containing multiple files), I 
>>> understand that the number of partitions created are equal to the number of 
>>> HDFS blocks, correct? Are those created in a lazy way? I mean, if the 
>>> number of blocks/partitions is larger than the number of cores/threads the 
>>> Spark driver was launched with (N), are N partitions created initially and 
>>> then the rest when required? Or are all those partitions created up front?
>>> 
>>> I want to avoid reading the whole data into memory just to spill it out to 
>>> disk if there is no enough memory.
>>> 
>>> Thanks! 
>&g

RE: laziness in textFile reading from HDFS?

2015-10-05 Thread Mohammed Guller
Is there any specific reason for caching the RDD? How many passes you make over 
the dataset? 

Mohammed

-Original Message-
From: Matt Narrell [mailto:matt.narr...@gmail.com] 
Sent: Saturday, October 3, 2015 9:50 PM
To: Mohammed Guller
Cc: davidkl; user@spark.apache.org
Subject: Re: laziness in textFile reading from HDFS?

Is there any more information or best practices here?  I have the exact same 
issues when reading large data sets from HDFS (larger than available RAM) and I 
cannot run without setting the RDD persistence level to MEMORY_AND_DISK_SER, 
and using nearly all the cluster resources.

Should I repartition this RDD to be equal to the number of cores?  

I notice that the job duration on the YARN UI is about 30 minutes longer than 
the Spark UI.  When the job initially starts, there is no tasks shown in the 
Spark UI..?

All I;m doing is reading records from HDFS text files with sc.textFile, and 
rewriting them back to HDFS grouped by a timestamp.

Thanks,
mn

> On Sep 29, 2015, at 8:06 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
> 
> 1) It is not required to have the same amount of memory as data. 
> 2) By default the # of partitions are equal to the number of HDFS 
> blocks
> 3) Yes, the read operation is lazy
> 4) It is okay to have more number of partitions than number of cores. 
> 
> Mohammed
> 
> -Original Message-
> From: davidkl [mailto:davidkl...@hotmail.com]
> Sent: Monday, September 28, 2015 1:40 AM
> To: user@spark.apache.org
> Subject: laziness in textFile reading from HDFS?
> 
> Hello,
> 
> I need to process a significant amount of data every day, about 4TB. This 
> will be processed in batches of about 140GB. The cluster this will be running 
> on doesn't have enough memory to hold the dataset at once, so I am trying to 
> understand how this works internally.
> 
> When using textFile to read an HDFS folder (containing multiple files), I 
> understand that the number of partitions created are equal to the number of 
> HDFS blocks, correct? Are those created in a lazy way? I mean, if the number 
> of blocks/partitions is larger than the number of cores/threads the Spark 
> driver was launched with (N), are N partitions created initially and then the 
> rest when required? Or are all those partitions created up front?
> 
> I want to avoid reading the whole data into memory just to spill it out to 
> disk if there is no enough memory.
> 
> Thanks! 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textFi
> le-reading-from-HDFS-tp24837.html Sent from the Apache Spark User List 
> mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark thrift service and Hive impersonation.

2015-09-29 Thread Mohammed Guller
When a user issues a connect command from Beeline, it asks for username and 
password. What happens if you give spark as the user name?

Also, it looks like permission for "/data/mytable” is drwxr-x—x

Have you tried changing the permission to allow other users to read?

Mohammed

From: Jagat Singh [mailto:jagatsi...@gmail.com]
Sent: Tuesday, September 29, 2015 6:32 PM
To: Mohammed Guller
Cc: SparkUser
Subject: Re: Spark thrift service and Hive impersonation.

Hi,

Thanks for your reply.

If you see the log message

Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table 
mytable. java.security.AccessControlException: Permission denied: user=spark, 
access=READ, inode="/data/mytable":tdcprdr:tdcprdr:drwxr-x--x

Spark is trying to read as spark user , using which we started thrift server.

Since spark user does not have actual read access we get the error.

However the beeline is used by end user not spark user and throws error.

Thanks,

Jagat Singh



On Wed, Sep 30, 2015 at 11:24 AM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Does each user needs to start own thrift server to use it?

No. One of the benefits of the Spark Thrift Server is that it allows multiple 
users to share a single SparkContext.

Most likely, you have file permissions issue.

Mohammed

From: Jagat Singh [mailto:jagatsi...@gmail.com<mailto:jagatsi...@gmail.com>]
Sent: Tuesday, September 29, 2015 5:30 PM
To: SparkUser
Subject: Spark thrift service and Hive impersonation.

Hi,

I have started the Spark thrift service using spark user.

Does each user needs to start own thrift server to use it?

Using beeline i am able to connect to server and execute show tables;

However when we try to execute some real query it runs as spark user and HDFS 
permissions does not allow them to be read.

The query fails with error

0: jdbc:hive2://localhost:1> select count(*) from mytable;
Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table 
mytable. java.security.AccessControlException: Permission denied: user=spark, 
access=READ, inode="/data/mytable":tdcprdr:tdcprdr:drwxr-x--x
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)


And in thrift server we get log.


In the hive-site.xml we have impersonation enabled.

   
  hive.server2.enable.doAs
  true



  hive.server2.enable.impersonation
  true


Is there any other configuration to be done for it to work like normal hive 
thrift server.

Thanks



RE: laziness in textFile reading from HDFS?

2015-09-29 Thread Mohammed Guller
1) It is not required to have the same amount of memory as data. 
2) By default the # of partitions are equal to the number of HDFS blocks
3) Yes, the read operation is lazy
4) It is okay to have more number of partitions than number of cores. 

Mohammed

-Original Message-
From: davidkl [mailto:davidkl...@hotmail.com] 
Sent: Monday, September 28, 2015 1:40 AM
To: user@spark.apache.org
Subject: laziness in textFile reading from HDFS?

Hello,

I need to process a significant amount of data every day, about 4TB. This will 
be processed in batches of about 140GB. The cluster this will be running on 
doesn't have enough memory to hold the dataset at once, so I am trying to 
understand how this works internally.

When using textFile to read an HDFS folder (containing multiple files), I 
understand that the number of partitions created are equal to the number of 
HDFS blocks, correct? Are those created in a lazy way? I mean, if the number of 
blocks/partitions is larger than the number of cores/threads the Spark driver 
was launched with (N), are N partitions created initially and then the rest 
when required? Or are all those partitions created up front?

I want to avoid reading the whole data into memory just to spill it out to disk 
if there is no enough memory.

Thanks! 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textFile-reading-from-HDFS-tp24837.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark thrift service and Hive impersonation.

2015-09-29 Thread Mohammed Guller
Does each user needs to start own thrift server to use it?

No. One of the benefits of the Spark Thrift Server is that it allows multiple 
users to share a single SparkContext.

Most likely, you have file permissions issue.

Mohammed

From: Jagat Singh [mailto:jagatsi...@gmail.com]
Sent: Tuesday, September 29, 2015 5:30 PM
To: SparkUser
Subject: Spark thrift service and Hive impersonation.

Hi,

I have started the Spark thrift service using spark user.

Does each user needs to start own thrift server to use it?

Using beeline i am able to connect to server and execute show tables;

However when we try to execute some real query it runs as spark user and HDFS 
permissions does not allow them to be read.

The query fails with error

0: jdbc:hive2://localhost:1> select count(*) from mytable;
Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table 
mytable. java.security.AccessControlException: Permission denied: user=spark, 
access=READ, inode="/data/mytable":tdcprdr:tdcprdr:drwxr-x--x
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)


And in thrift server we get log.


In the hive-site.xml we have impersonation enabled.

   
  hive.server2.enable.doAs
  true



  hive.server2.enable.impersonation
  true


Is there any other configuration to be done for it to work like normal hive 
thrift server.

Thanks


RE: Cassandra Summit 2015 Roll Call!

2015-09-22 Thread Mohammed Guller
Hey everyone,
I will be at the summit too on Wed and Thu.  I am giving a talk on Thursday at 
2.40pm.

Would love to meet everyone on this list in person.  Here is an old picture of 
mine:
https://events.mfactormeetings.com/accounts/register123/mfactor/datastax/events/dstaxsummit2015/guller.jpg

Mohammed

From: Carlos Alonso [mailto:i...@mrcalonso.com]
Sent: Tuesday, September 22, 2015 5:23 PM
To: user@cassandra.apache.org
Subject: Re: Cassandra Summit 2015 Roll Call!

Hi guys.

I'm already here and I'll be the whole Summit. I'll be doing a live demo on 
Thursday on troubleshooting Cassandra production issues as a developer.

This is me!! https://twitter.com/calonso/status/646352711454097408

Carlos Alonso | Software Engineer | @calonso

On 22 September 2015 at 15:27, Jeff Jirsa 
> wrote:
I’m here. Will be speaking Wednesday on DTCS for time series workloads: 
http://cassandrasummit-datastax.com/agenda/real-world-dtcs-for-operators/

Picture if you recognize me, say hi: 
https://events.mfactormeetings.com/accounts/register123/mfactor/datastax/events/dstaxsummit2015/jirsa.jpg
 (probably wearing glasses and carrying a black Crowdstrike backpack)

- Jeff


From: Robert Coli
Reply-To: "user@cassandra.apache.org"
Date: Tuesday, September 22, 2015 at 11:27 AM
To: "user@cassandra.apache.org"
Subject: Cassandra Summit 2015 Roll Call!

Cassandra Summit 2015 is upon us!

Every year, the conference gets bigger and bigger, and the chance of IRL 
meeting people you've "met" online gets smaller and smaller.

To improve everyone's chances, if you are attending the summit :

1) respond on-thread with a brief introduction (and physical description of 
yourself if you want others to be able to spot you!)
2) join #cassandra on freenode IRC (irc.freenode.org) 
to chat and connect with other attendees!

MY CONTRIBUTION :
--
I will be at the summit on Wednesday and Thursday. I am 5'8" or so, and will be 
wearing glasses and either a red or blue "Eventbrite Engineering" t-shirt with 
a graphic logo of gears on it. Come say hello! :D

=Rob




RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
One options is to use the coalesce method in the RDD class.

Mohammed

From: Brandon White [mailto:bwwintheho...@gmail.com]
Sent: Tuesday, August 4, 2015 7:23 PM
To: user
Subject: Combining Spark Files with saveAsTextFile


What is the best way to make saveAsTextFile save as only a single file?


RE: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Mohammed Guller
Just to further clarify, you can first call coalesce with argument 1 and then 
call saveAsTextFile. For example,

rdd.coalesce(1).saveAsTextFile(...)



Mohammed

From: Mohammed Guller
Sent: Tuesday, August 4, 2015 9:39 PM
To: 'Brandon White'; user
Subject: RE: Combining Spark Files with saveAsTextFile

One options is to use the coalesce method in the RDD class.

Mohammed

From: Brandon White [mailto:bwwintheho...@gmail.com]
Sent: Tuesday, August 4, 2015 7:23 PM
To: user
Subject: Combining Spark Files with saveAsTextFile


What is the best way to make saveAsTextFile save as only a single file?


Spark SQL unable to recognize schema name

2015-08-04 Thread Mohammed Guller
Hi -

I am running the Thrift JDBC/ODBC server (v1.4.1) and encountered a problem 
when querying tables using fully qualified table names(schemaName.tableName). 
The following query works fine from the beeline tool:

SELECT * from test;

However, the following query throws an exception, even though the table “test” 
does exist under the “default” schema:

SELECT * from default.test;

Error: org.apache.spark.sql.AnalysisException: no such table default.test; line 
1 pos 22 (state=,code=0)

Here is the exception trace on the Thrift Server console:

15/08/04 14:27:03 WARN ThriftCLIService: Error executing statement:
org.apache.hive.service.cli.HiveSQLException: 
org.apache.spark.sql.AnalysisException: no such table default.test; line 1 pos 
22
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:206)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
at sun.reflect.GeneratedMethodAccessor65.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
at java.security.AccessController.doPrivileged(Native Method)


Is it a bug in 1.4.1 or am I missing some configuration parameter?

Mohammed



RE: Heatmap with Spark Streaming

2015-07-30 Thread Mohammed Guller
Umesh,
You can create a web-service in any of the languages supported by Spark and 
stream the result from this web-service to your D3-based client using Websocket 
or Server-Sent Events.

For example, you can create a webservice using Play. This app will integrate 
with Spark streaming in the back-end. The front-end can be a D3-based or any 
Javascript app.  Play makes it easy to stream data to a web client. So a client 
does not need to continuously poll data from the back-end server.

Mohammed

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Thursday, July 30, 2015 2:07 AM
To: UMESH CHAUDHARY
Cc: user@spark.apache.org
Subject: Re: Heatmap with Spark Streaming

You can integrate it with any language (like php) and use ajax calls to update 
the charts.

Thanks
Best Regards

On Thu, Jul 30, 2015 at 2:11 PM, UMESH CHAUDHARY 
umesh9...@gmail.commailto:umesh9...@gmail.com wrote:
Thanks For the suggestion Akhil!
I looked at https://github.com/mbostock/d3/wiki/Gallery to know more about d3, 
all examples described here are on static data, how we can update our heat map 
from updated data, if we store it in Hbase or Mysql. I mean, do we need to 
query back and fourth for it.
Is there any pluggable and more quick component for heatmap with spark 
streaming?

On Thu, Jul 30, 2015 at 1:23 PM, Akhil Das 
ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com wrote:
You can easily push data to an intermediate storage from spark streaming (like 
HBase or a SQL/NoSQL DB etc) and then power your dashboards with d3 js.

Thanks
Best Regards

On Tue, Jul 28, 2015 at 12:18 PM, UMESH CHAUDHARY 
umesh9...@gmail.commailto:umesh9...@gmail.com wrote:
I have just started using Spark Streaming and done few POCs. It is fairly easy 
to implement. I was thinking of presenting the data using some smart graphing  
dashboarding tools e.g. Graphite or Grafna, but they don't have heat-maps. I 
also looked at Zeppelinhttp://zeppelin-project.org/ , but unable to found any 
heat-map functionality. Could you please suggest any data visualization tools 
using Heat-map and Spark streaming.







RE: Need help in SparkSQL

2015-07-22 Thread Mohammed Guller
Parquet

Mohammed

From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: Wednesday, July 22, 2015 5:48 AM
To: user
Subject: Need help in SparkSQL

HI All,

I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex 
queries analysis on this data.Queries like AND queries involved multiple fields

So my question in which which format I should store the data in HDFS so that 
processing will be fast for such kind of queries?


Regards
Jeetendra



RE: Kmeans Labeled Point RDD

2015-07-20 Thread Mohammed Guller
I responded to your question on SO. Let me know if this what you wanted. 

http://stackoverflow.com/a/31528274/2336943


Mohammed

-Original Message-
From: plazaster [mailto:michaelplaz...@gmail.com] 
Sent: Sunday, July 19, 2015 11:38 PM
To: user@spark.apache.org
Subject: Re: Kmeans Labeled Point RDD

Has there been any progress on this, I am in the same boat.

I posted a similar question to Stack Exchange.

http://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989p23907.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
Michael,
How would the Catalyst optimizer optimize this version?
df.filter(df(filter_field) === value).select(field1).show()
Would it still read all the columns in df or would it read only “filter_field” 
and “field1” since only two columns are used (assuming other columns from df 
are not used anywhere else)?

Mohammed

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, July 17, 2015 1:39 PM
To: Mike Trienis
Cc: user@spark.apache.org
Subject: Re: Data frames select and where clause dependency

Each operation on a dataframe is completely independent and doesn't know what 
operations happened before it.  When you do a selection, you are removing other 
columns from the dataframe and so the filter has nothing to operate on.

On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis 
mike.trie...@orcsol.commailto:mike.trie...@orcsol.com wrote:
I'd like to understand why the where field must exist in the select clause.

For example, the following select statement works fine

  *   df.select(field1, filter_field).filter(df(filter_field) === 
value).show()
However, the next one fails with the error in operator !Filter 
(filter_field#60 = value);

  *   df.select(field1).filter(df(filter_field) === value).show()
As a work-around, it seems that I can do the following

  *   df.select(field1, filter_field).filter(df(filter_field) === 
value).drop(filter_field).show()

Thanks, Mike.



RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
Thanks, Harish.

Mike – this would be a cleaner version for your use case:
df.filter(df(filter_field) === value).select(field1).show()

Mohammed

From: Harish Butani [mailto:rhbutani.sp...@gmail.com]
Sent: Monday, July 20, 2015 5:37 PM
To: Mohammed Guller
Cc: Michael Armbrust; Mike Trienis; user@spark.apache.org
Subject: Re: Data frames select and where clause dependency

Yes via:  org.apache.spark.sql.catalyst.optimizer.ColumnPruning
See DefaultOptimizer.batches for list of logical rewrites.

You can see the optimized plan by printing: df.queryExecution.optimizedPlan

On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Michael,
How would the Catalyst optimizer optimize this version?
df.filter(df(filter_field) === value).select(field1).show()
Would it still read all the columns in df or would it read only “filter_field” 
and “field1” since only two columns are used (assuming other columns from df 
are not used anywhere else)?

Mohammed

From: Michael Armbrust 
[mailto:mich...@databricks.commailto:mich...@databricks.com]
Sent: Friday, July 17, 2015 1:39 PM
To: Mike Trienis
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Data frames select and where clause dependency

Each operation on a dataframe is completely independent and doesn't know what 
operations happened before it.  When you do a selection, you are removing other 
columns from the dataframe and so the filter has nothing to operate on.

On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis 
mike.trie...@orcsol.commailto:mike.trie...@orcsol.com wrote:
I'd like to understand why the where field must exist in the select clause.

For example, the following select statement works fine

  *   df.select(field1, filter_field).filter(df(filter_field) === 
value).show()
However, the next one fails with the error in operator !Filter 
(filter_field#60 = value);

  *   df.select(field1).filter(df(filter_field) === value).show()
As a work-around, it seems that I can do the following

  *   df.select(field1, filter_field).filter(df(filter_field) === 
value).drop(filter_field).show()

Thanks, Mike.




RE: Feature Generation On Spark

2015-07-18 Thread Mohammed Guller
Try this (replace ... with the appropriate values for your environment):

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector

val sc = new SparkContext(...)
val documents = sc.wholeTextFile(...)
val tokenized = documents.map{ case(path, document) = (path, 
document.split(\\s+))}
val numFeatures = 10
val hashingTF = new HashingTF(numFeatures)
val featurized = tokenized.map{case(path, words) = (path, 
hashingTF.transform(words))}


Mohammed

From: rishikesh thakur [mailto:rishikeshtha...@hotmail.com]
Sent: Friday, July 17, 2015 12:33 AM
To: Mohammed Guller
Subject: Re: Feature Generation On Spark


Thanks I did look at the example. I am using Spark 1.2. The modules mentioned 
there are not in 1.2 I guess. The import is failing


Rishi


From: Mohammed Guller moham...@glassbeam.commailto:moham...@glassbeam.com
Sent: Friday, July 10, 2015 2:31 AM
To: rishikesh thakur; ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark


Take a look at the examples here:

https://spark.apache.org/docs/latest/ml-guide.html



Mohammed



From: rishikesh thakur [mailto:rishikeshtha...@hotmail.com]
Sent: Saturday, July 4, 2015 10:49 PM
To: ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark



I have one document per file and each file is to be converted to a feature 
vector. Pretty much like standard feature construction for document 
classification.



Thanks

Rishi



Date: Sun, 5 Jul 2015 01:44:04 +1000
Subject: Re: Feature Generation On Spark
From: guha.a...@gmail.commailto:guha.a...@gmail.com
To: mici...@gmail.commailto:mici...@gmail.com
CC: rishikeshtha...@hotmail.commailto:rishikeshtha...@hotmail.com; 
user@spark.apache.orgmailto:user@spark.apache.org

Do you have one document per file or multiple document in the file?

On 4 Jul 2015 23:38, Michal Čizmazia 
mici...@gmail.commailto:mici...@gmail.com wrote:

Spark Context has a method wholeTextFiles. Is that what you need?

On 4 July 2015 at 07:04, rishikesh 
rishikeshtha...@hotmail.commailto:rishikeshtha...@hotmail.com wrote:
 Hi

 I am new to Spark and am working on document classification. Before model
 fitting I need to do feature generation. Each document is to be converted to
 a feature vector. However I am not sure how to do that. While testing
 locally I have a static list of tokens and when I parse a file I do a lookup
 and increment counters.

 In the case of Spark I can create an RDD which loads all the documents
 however I am not sure if one files goes to one executor or multiple. If the
 file is split then the feature vectors needs to be merged. But I am not able
 to figure out how to do that.

 Thanks
 Rishi



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: 
 user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: 
 user-h...@spark.apache.orgmailto:user-h...@spark.apache.org


-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org


RE: Any beginner samples for using ML / MLIB to produce a moving average of a (K, iterable[V])

2015-07-15 Thread Mohammed Guller
I could be wrong, but it looks like the only implementation available right now 
is MultivariateOnlineSummarizer.

Mohammed

From: Nkechi Achara [mailto:nkach...@googlemail.com]
Sent: Wednesday, July 15, 2015 4:31 AM
To: user@spark.apache.org
Subject: Any beginner samples for using ML / MLIB to produce a moving average 
of a (K, iterable[V])

Hi all,

I am trying to get some summary statistics to retrieve the moving average for 
several devices that have an array or latency in seconds in this kind of format:

deviceLatencyMap = [K:String, Iterable[V: Double]]

I understand that there is a MultivariateSummary, but as this is a trait, but I 
can't understand what I use in it's stead.

If you need any more code, please let me know.

Thanks All.

K



RE: Spark performance

2015-07-13 Thread Mohammed Guller
Good points, Michael.

The underlying assumption in my statement is that cost is an issue. If cost is 
not an issue and the only requirement is to query structured data, then there 
are several databases such as Teradata, Exadata, and Vertica that can handle 
4-6 TB of data and outperform Spark.

Mohammed

From: Michael Segel [mailto:msegel_had...@hotmail.com]
Sent: Sunday, July 12, 2015 6:59 AM
To: Mohammed Guller
Cc: David Mitchell; Roman Sokolov; user; Ravisankar Mani
Subject: Re: Spark performance

Not necessarily.

It depends on the use case and what you intend to do with the data.

4-6 TB will easily fit on an SMP box and can be efficiently searched by an 
RDBMS.
Again it depends on what you want to do and how you want to do it.

Informix’s IDS engine with its extensibility could still outperform spark in 
some use cases based on the proper use of indexes and amount of parallelism.

There is a lot of cross over… now had you said 100TB+ on unstructured data… 
things may be different.

Please understand that what would make spark more compelling is the TCO of the 
solution when compared to SMP boxes and software licensing.

Its not that I don’t disagree with your statements, because moving from mssql 
or any small RDBMS to spark … doesn’t make a whole lot of sense.
Just wanted to add that the decision isn’t as cut and dry as some think….

On Jul 11, 2015, at 8:47 AM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:

Hi Roman,
Yes, Spark SQL will be a better solution than standard RDBMS databases for 
querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a 
powerful analytics solution.

Mohammed

From: David Mitchell [mailto:jdavidmitch...@gmail.com]
Sent: Saturday, July 11, 2015 7:10 AM
To: Roman Sokolov
Cc: Mohammed Guller; user; Ravisankar Mani
Subject: Re: Spark performance

You can certainly query over 4 TB of data with Spark.  However, you will get an 
answer in minutes or hours, not in milliseconds or seconds.  OLTP databases are 
used for web applications, and typically return responses in milliseconds.  
Analytic databases tend to operate on large data sets, and return responses in 
seconds, minutes or hours.  When running batch jobs over large data sets, Spark 
can be a replacement for analytic databases like Greenplum or Netezza.



On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov 
ole...@gmail.commailto:ole...@gmail.com wrote:
Hello. Had the same question. What if I need to store 4-6 Tb and do queries? 
Can't find any clue in documentation.
Am 11.07.2015 03:28 schrieb Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com:
Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute engines, 
which need to be paired with a storage system. Seconds, they are designed for 
processing large distributed datasets. If you have only 100,000 records or even 
a million records, you don’t need Spark. A RDBMS will perform much better for 
that volume of data.

Mohammed

From: Ravisankar Mani [mailto:rrav...@gmail.commailto:rrav...@gmail.com]
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark performance

Hi everyone,
I have planned to move mssql server to spark?.  I have using around 50,000 to 
1l records.
 The spark performance is slow when compared to mssql server.

What is the best data base(Spark or sql) to store or retrieve data around 
50,000 to 1l records ?
regards,
Ravi




--
### Confidential e-mail, for recipient's (or recipients') eyes only, not for 
distribution. ###



RE: Spark performance

2015-07-11 Thread Mohammed Guller
Hi Roman,
Yes, Spark SQL will be a better solution than standard RDBMS databases for 
querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a 
powerful analytics solution.

Mohammed

From: David Mitchell [mailto:jdavidmitch...@gmail.com]
Sent: Saturday, July 11, 2015 7:10 AM
To: Roman Sokolov
Cc: Mohammed Guller; user; Ravisankar Mani
Subject: Re: Spark performance

You can certainly query over 4 TB of data with Spark.  However, you will get an 
answer in minutes or hours, not in milliseconds or seconds.  OLTP databases are 
used for web applications, and typically return responses in milliseconds.  
Analytic databases tend to operate on large data sets, and return responses in 
seconds, minutes or hours.  When running batch jobs over large data sets, Spark 
can be a replacement for analytic databases like Greenplum or Netezza.



On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov 
ole...@gmail.commailto:ole...@gmail.com wrote:

Hello. Had the same question. What if I need to store 4-6 Tb and do queries? 
Can't find any clue in documentation.
Am 11.07.2015 03:28 schrieb Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com:
Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute engines, 
which need to be paired with a storage system. Seconds, they are designed for 
processing large distributed datasets. If you have only 100,000 records or even 
a million records, you don’t need Spark. A RDBMS will perform much better for 
that volume of data.

Mohammed

From: Ravisankar Mani [mailto:rrav...@gmail.commailto:rrav...@gmail.com]
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark performance

Hi everyone,
I have planned to move mssql server to spark?.  I have using around 50,000 to 
1l records.
 The spark performance is slow when compared to mssql server.

What is the best data base(Spark or sql) to store or retrieve data around 
50,000 to 1l records ?
regards,
Ravi




--
### Confidential e-mail, for recipient's (or recipients') eyes only, not for 
distribution. ###


RE: Spark performance

2015-07-10 Thread Mohammed Guller
Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute engines, 
which need to be paired with a storage system. Seconds, they are designed for 
processing large distributed datasets. If you have only 100,000 records or even 
a million records, you don’t need Spark. A RDBMS will perform much better for 
that volume of data.

Mohammed

From: Ravisankar Mani [mailto:rrav...@gmail.com]
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.org
Subject: Spark performance

Hi everyone,
I have planned to move mssql server to spark?.  I have using around 50,000 to 
1l records.
 The spark performance is slow when compared to mssql server.

What is the best data base(Spark or sql) to store or retrieve data around 
50,000 to 1l records ?
regards,
Ravi



RE: Feature Generation On Spark

2015-07-09 Thread Mohammed Guller
Take a look at the examples here:
https://spark.apache.org/docs/latest/ml-guide.html

Mohammed

From: rishikesh thakur [mailto:rishikeshtha...@hotmail.com]
Sent: Saturday, July 4, 2015 10:49 PM
To: ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark

I have one document per file and each file is to be converted to a feature 
vector. Pretty much like standard feature construction for document 
classification.

Thanks
Rishi

Date: Sun, 5 Jul 2015 01:44:04 +1000
Subject: Re: Feature Generation On Spark
From: guha.a...@gmail.commailto:guha.a...@gmail.com
To: mici...@gmail.commailto:mici...@gmail.com
CC: rishikeshtha...@hotmail.commailto:rishikeshtha...@hotmail.com; 
user@spark.apache.orgmailto:user@spark.apache.org
Do you have one document per file or multiple document in the file?
On 4 Jul 2015 23:38, Michal Čizmazia 
mici...@gmail.commailto:mici...@gmail.com wrote:
Spark Context has a method wholeTextFiles. Is that what you need?

On 4 July 2015 at 07:04, rishikesh 
rishikeshtha...@hotmail.commailto:rishikeshtha...@hotmail.com wrote:
 Hi

 I am new to Spark and am working on document classification. Before model
 fitting I need to do feature generation. Each document is to be converted to
 a feature vector. However I am not sure how to do that. While testing
 locally I have a static list of tokens and when I parse a file I do a lookup
 and increment counters.

 In the case of Spark I can create an RDD which loads all the documents
 however I am not sure if one files goes to one executor or multiple. If the
 file is split then the feature vectors needs to be merged. But I am not able
 to figure out how to do that.

 Thanks
 Rishi



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: 
 user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: 
 user-h...@spark.apache.orgmailto:user-h...@spark.apache.org


-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org


RE: How to create a LabeledPoint RDD from a Data Frame

2015-07-06 Thread Mohammed Guller
Have you looked at the new Spark ML library? You can use a DataFrame directly 
with the Spark ML API.

https://spark.apache.org/docs/latest/ml-guide.html


Mohammed

From: Sourav Mazumder [mailto:sourav.mazumde...@gmail.com]
Sent: Monday, July 6, 2015 10:29 AM
To: user
Subject: How to create a LabeledPoint RDD from a Data Frame

Hi,
I have a Dataframe which I want to use for creating a RandomForest model using 
MLLib.
The RandonForest model needs a RDD with LabeledPoints.
Wondering how do I convert the DataFrame to LabeledPointRDD
Regards,
Sourav


RE: How do we control output part files created by Spark job?

2015-07-06 Thread Mohammed Guller
You could repartition the dataframe before saving it. However, that would 
impact the parallelism of the next jobs that reads these file from HDFS.

Mohammed


-Original Message-
From: kachau [mailto:umesh.ka...@gmail.com] 
Sent: Monday, July 6, 2015 10:23 AM
To: user@spark.apache.org
Subject: How do we control output part files created by Spark job?

Hi I am having couple of Spark jobs which processes thousands of files every 
day. File size may very from MBs to GBs. After finishing job I usually save 
using the following code

finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file as of 
Spark 1.4

Spark job creates plenty of small part files in final output directory. As far 
as I understand Spark creates part file for each partition/task please correct 
me if I am wrong. How do we control amount of part files Spark creates? Finally 
I would like to create Hive table using these parquet/orc directory and I heard 
Hive is slow when we have large no of small files.
Please guide I am new to Spark. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark SQL queries hive table, real time ?

2015-07-06 Thread Mohammed Guller
Hi Florian,
It depends on a number of factors. How much data are you querying? Where is the 
data stored (HDD, SSD or DRAM)? What is the file format (Parquet or CSV)?

In theory, it is possible to use Spark SQL for real-time queries, but cost 
increases as the data size grows. If you can store all of your data in memory, 
then you should be able to query it in real-time ☺ On the other extreme,  if 
Spark SQL has to read a terabyte of data from spinning disk, there is no way it 
can respond in real-time. To be fair, no software can read a terabyte of data 
from HDD in real-time. Simple laws of physics. Either you will have to spread 
out the reads over a large number of disks and read them in parallel. 
Alternatively, index the data so that your queries don’t have to read a 
terabyte of data from disk.

Hope that helps.

Mohammed

From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Monday, July 6, 2015 4:21 AM
To: spierki; user@spark.apache.org
Subject: Re: Spark SQL queries hive table, real time ?

Within the context of your question, Spark SQL utilizing the Hive context is 
primarily about very fast queries.  If you want to use real-time queries, I 
would utilize Spark Streaming.  A couple of great resources on this topic 
include Guest Lecture on Spark Streaming in Stanford CME 323: Distributed 
Algorithms and 
Optimizationhttp://www.slideshare.net/tathadas/guest-lecture-on-spark-streaming-in-standford
 and Recipes for Running Spark Streaming Applications in 
Productionhttps://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/
 (from the recent Spark Summit 2015)

HTH!


On Mon, Jul 6, 2015 at 3:23 PM spierki 
florian.spierc...@crisalid.commailto:florian.spierc...@crisalid.com wrote:
Hello,

I'm actually asking my self about performance of using Spark SQL with Hive
to do real time analytics.
I know that Hive has been created for batch processing, and Spark is use to
do fast queries.

But, use Spark SQL with Hive will allow me to do real time queries ? Or it
just will make fastest queries but not real time.
Should I use an other datawarehouse, like Hbase ?

Thanks in advance for your time and consideration,
Florian



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-queries-hive-table-real-time-tp23642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org


RE: Spark application with a RESTful API

2015-07-06 Thread Mohammed Guller
It is not a bad idea. Many people use this approach.

Mohammed


-Original Message-
From: Sagi r [mailto:stsa...@gmail.com] 
Sent: Monday, July 6, 2015 1:58 PM
To: user@spark.apache.org
Subject: Spark application with a RESTful API

Hi,

I've been researching spark for a couple of months now, and I strongly believe 
it can solve our problem.

We are developing an application that allows the user to analyze various 
sources of information. We are dealing with non-technical users, so simply 
giving them and interactive shell won't do the trick.

To allow the users to execute queries, I have considered writing a Spark 
application that exposes a RESTful api and runs on our cluster. This 
application will execute the queries on demand on different threads.
We need to serve a few thousand users.

I should mention that I've looked into Spark Job-Server too, it looks promising 
however it's not quite what we are looking for.

I wanted to here your input on this solution, and maybe if you can suggest a 
better one. 
Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-application-with-a-RESTful-API-tp23654.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: making dataframe for different types using spark-csv

2015-07-01 Thread Mohammed Guller
Another option is to provide the schema to the load method. One variant of the 
sqlContext.load takes a schema as a input parameter. You can define the schema 
programmatically as shown here:

https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Mohammed

From: Krishna Sankar [mailto:ksanka...@gmail.com]
Sent: Wednesday, July 1, 2015 3:09 PM
To: Hafiz Mujadid
Cc: user@spark.apache.org
Subject: Re: making dataframe for different types using spark-csv

·  use .cast(...).alias('...') after the DataFrame is read.
·  sql.functions.udf for any domain-specific conversions.
Cheers
[https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif]k/

On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid 
hafizmujadi...@gmail.commailto:hafizmujadi...@gmail.com wrote:
Hi experts!


I am using spark-csv to lead csv data into dataframe. By default it makes
type of each column as string. Is there some way to get dataframe of actual
types like int,double etc.?


Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/making-dataframe-for-different-types-using-spark-csv-tp23570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org



RE: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Mohammed Guller
I haven’t tried using Zeppelin with Spark on Cassandra, so can’t say for sure, 
but it should not be difficult.

Mohammed

From: Matthew Johnson [mailto:matt.john...@algomi.com]
Sent: Monday, June 22, 2015 2:15 AM
To: Mohammed Guller; shahid ashraf
Cc: user@spark.apache.org
Subject: RE: Code review - Spark SQL command-line client for Cassandra

Thanks Mohammed, it’s good to know I’m not alone!

How easy is it to integrate Zeppelin with Spark on Cassandra? It looks like it 
would only support Hadoop out of the box. Is it just a case of dropping the 
Cassandra Connector onto the Spark classpath?

Cheers,
Matthew

From: Mohammed Guller 
[mailto:moham...@glassbeam.commailto:moham...@glassbeam.com]
Sent: 20 June 2015 17:27
To: shahid ashraf
Cc: Matthew Johnson; user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE: Code review - Spark SQL command-line client for Cassandra

It is a simple Play-based web application. It exposes an URI for submitting a 
SQL query. It then executes that query using CassandraSQLContext provided by 
Spark Cassandra Connector. Since it is web-based, I added an authentication and 
authorization layer to make sure that only users with the right authorization 
can use it.

I am happy to open-source that code if there is interest. Just need to carve 
out some time to clean it up and remove all the other services that this web 
application provides.

Mohammed

From: shahid ashraf [mailto:sha...@trialx.com]
Sent: Saturday, June 20, 2015 6:52 AM
To: Mohammed Guller
Cc: Matthew Johnson; user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE: Code review - Spark SQL command-line client for Cassandra


Hi Mohammad
Can you provide more info about the Service u developed
On Jun 20, 2015 7:59 AM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Hi Matthew,
It looks fine to me. I have built a similar service that allows a user to 
submit a query from a browser and returns the result in JSON format.

Another alternative is to leave a Spark shell or one of the notebooks (Spark 
Notebook, Zeppelin, etc.) session open and run queries from there. This model 
works only if people give you the queries to execute.

Mohammed

From: Matthew Johnson 
[mailto:matt.john...@algomi.commailto:matt.john...@algomi.com]
Sent: Friday, June 19, 2015 2:20 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Code review - Spark SQL command-line client for Cassandra

Hi all,

I have been struggling with Cassandra’s lack of adhoc query support (I know 
this is an anti-pattern of Cassandra, but sometimes management come over and 
ask me to run stuff and it’s impossible to explain that it will take me a while 
when it would take about 10 seconds in MySQL) so I have put together the 
following code snippet that bundles DataStax’s Cassandra Spark connector and 
allows you to submit Spark SQL to it, outputting the results in a text file.

Does anyone spot any obvious flaws in this plan?? (I have a lot more error 
handling etc in my code, but removed it here for brevity)

private void run(String sqlQuery) {
SparkContext scc = new SparkContext(conf);
CassandraSQLContext csql = new CassandraSQLContext(scc);
DataFrame sql = csql.sql(sqlQuery);
String folderName = /tmp/output_ + System.currentTimeMillis();
LOG.info(Attempting to save SQL results in folder:  + folderName);
sql.rdd().saveAsTextFile(folderName);
LOG.info(SQL results saved);
}

public static void main(String[] args) {

String sparkMasterUrl = args[0];
String sparkHost = args[1];
String sqlQuery = args[2];

SparkConf conf = new SparkConf();
conf.setAppName(Java Spark SQL);
conf.setMaster(sparkMasterUrl);
conf.set(spark.cassandra.connection.host, sparkHost);

JavaSparkSQL app = new JavaSparkSQL(conf);

app.run(sqlQuery, printToConsole);
}

I can then submit this to Spark with ‘spark-submit’:


  ./spark-submit --class com.algomi.spark.JavaSparkSQL --master 
 spark://sales3:7077 
 spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
 spark://sales3:7077 sales3 select * from mykeyspace.operationlog

It seems to work pretty well, so I’m pretty happy, but wondering why this isn’t 
common practice (at least I haven’t been able to find much about it on Google) 
– is there something terrible that I’m missing?

Thanks!
Matthew




RE: Code review - Spark SQL command-line client for Cassandra

2015-06-20 Thread Mohammed Guller
It is a simple Play-based web application. It exposes an URI for submitting a 
SQL query. It then executes that query using CassandraSQLContext provided by 
Spark Cassandra Connector. Since it is web-based, I added an authentication and 
authorization layer to make sure that only users with the right authorization 
can use it.

I am happy to open-source that code if there is interest. Just need to carve 
out some time to clean it up and remove all the other services that this web 
application provides.

Mohammed

From: shahid ashraf [mailto:sha...@trialx.com]
Sent: Saturday, June 20, 2015 6:52 AM
To: Mohammed Guller
Cc: Matthew Johnson; user@spark.apache.org
Subject: RE: Code review - Spark SQL command-line client for Cassandra


Hi Mohammad
Can you provide more info about the Service u developed
On Jun 20, 2015 7:59 AM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Hi Matthew,
It looks fine to me. I have built a similar service that allows a user to 
submit a query from a browser and returns the result in JSON format.

Another alternative is to leave a Spark shell or one of the notebooks (Spark 
Notebook, Zeppelin, etc.) session open and run queries from there. This model 
works only if people give you the queries to execute.

Mohammed

From: Matthew Johnson 
[mailto:matt.john...@algomi.commailto:matt.john...@algomi.com]
Sent: Friday, June 19, 2015 2:20 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Code review - Spark SQL command-line client for Cassandra

Hi all,

I have been struggling with Cassandra’s lack of adhoc query support (I know 
this is an anti-pattern of Cassandra, but sometimes management come over and 
ask me to run stuff and it’s impossible to explain that it will take me a while 
when it would take about 10 seconds in MySQL) so I have put together the 
following code snippet that bundles DataStax’s Cassandra Spark connector and 
allows you to submit Spark SQL to it, outputting the results in a text file.

Does anyone spot any obvious flaws in this plan?? (I have a lot more error 
handling etc in my code, but removed it here for brevity)

private void run(String sqlQuery) {
SparkContext scc = new SparkContext(conf);
CassandraSQLContext csql = new CassandraSQLContext(scc);
DataFrame sql = csql.sql(sqlQuery);
String folderName = /tmp/output_ + System.currentTimeMillis();
LOG.info(Attempting to save SQL results in folder:  + folderName);
sql.rdd().saveAsTextFile(folderName);
LOG.info(SQL results saved);
}

public static void main(String[] args) {

String sparkMasterUrl = args[0];
String sparkHost = args[1];
String sqlQuery = args[2];

SparkConf conf = new SparkConf();
conf.setAppName(Java Spark SQL);
conf.setMaster(sparkMasterUrl);
conf.set(spark.cassandra.connection.host, sparkHost);

JavaSparkSQL app = new JavaSparkSQL(conf);

app.run(sqlQuery, printToConsole);
}

I can then submit this to Spark with ‘spark-submit’:


  ./spark-submit --class com.algomi.spark.JavaSparkSQL --master 
 spark://sales3:7077 
 spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
 spark://sales3:7077 sales3 select * from mykeyspace.operationlog

It seems to work pretty well, so I’m pretty happy, but wondering why this isn’t 
common practice (at least I haven’t been able to find much about it on Google) 
– is there something terrible that I’m missing?

Thanks!
Matthew




RE: Code review - Spark SQL command-line client for Cassandra

2015-06-19 Thread Mohammed Guller
Hi Matthew,
It looks fine to me. I have built a similar service that allows a user to 
submit a query from a browser and returns the result in JSON format.

Another alternative is to leave a Spark shell or one of the notebooks (Spark 
Notebook, Zeppelin, etc.) session open and run queries from there. This model 
works only if people give you the queries to execute.

Mohammed

From: Matthew Johnson [mailto:matt.john...@algomi.com]
Sent: Friday, June 19, 2015 2:20 AM
To: user@spark.apache.org
Subject: Code review - Spark SQL command-line client for Cassandra

Hi all,

I have been struggling with Cassandra’s lack of adhoc query support (I know 
this is an anti-pattern of Cassandra, but sometimes management come over and 
ask me to run stuff and it’s impossible to explain that it will take me a while 
when it would take about 10 seconds in MySQL) so I have put together the 
following code snippet that bundles DataStax’s Cassandra Spark connector and 
allows you to submit Spark SQL to it, outputting the results in a text file.

Does anyone spot any obvious flaws in this plan?? (I have a lot more error 
handling etc in my code, but removed it here for brevity)

private void run(String sqlQuery) {
SparkContext scc = new SparkContext(conf);
CassandraSQLContext csql = new CassandraSQLContext(scc);
DataFrame sql = csql.sql(sqlQuery);
String folderName = /tmp/output_ + System.currentTimeMillis();
LOG.info(Attempting to save SQL results in folder:  + folderName);
sql.rdd().saveAsTextFile(folderName);
LOG.info(SQL results saved);
}

public static void main(String[] args) {

String sparkMasterUrl = args[0];
String sparkHost = args[1];
String sqlQuery = args[2];

SparkConf conf = new SparkConf();
conf.setAppName(Java Spark SQL);
conf.setMaster(sparkMasterUrl);
conf.set(spark.cassandra.connection.host, sparkHost);

JavaSparkSQL app = new JavaSparkSQL(conf);

app.run(sqlQuery, printToConsole);
}

I can then submit this to Spark with ‘spark-submit’:


  ./spark-submit --class com.algomi.spark.JavaSparkSQL --master 
 spark://sales3:7077 
 spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
 spark://sales3:7077 sales3 select * from mykeyspace.operationlog

It seems to work pretty well, so I’m pretty happy, but wondering why this isn’t 
common practice (at least I haven’t been able to find much about it on Google) 
– is there something terrible that I’m missing?

Thanks!
Matthew




RE: Code review - Spark SQL command-line client for Cassandra

2015-06-19 Thread Mohammed Guller
Hi Matthew,
It looks fine to me. I have built a similar service that allows a user to 
submit a query from a browser and returns the result in JSON format.

Another alternative is to leave a Spark shell or one of the notebooks (Spark 
Notebook, Zeppelin, etc.) session open and run queries from there. This model 
works only if people give you the queries to execute.

Mohammed

From: Matthew Johnson [mailto:matt.john...@algomi.com]
Sent: Friday, June 19, 2015 2:21 AM
To: user@cassandra.apache.org
Subject: Code review - Spark SQL command-line client for Cassandra

Hi all,

I have been struggling with Cassandra’s lack of adhoc query support (I know 
this is an anti-pattern of Cassandra, but sometimes management come over and 
ask me to run stuff and it’s impossible to explain that it will take me a while 
when it would take about 10 seconds in MySQL) so I have put together the 
following code snippet that bundles DataStax’s Cassandra Spark connector and 
allows you to submit Spark SQL to it, outputting the results in a text file.

Does anyone spot any obvious flaws in this plan?? (I have a lot more error 
handling etc in my code, but removed it here for brevity)

private void run(String sqlQuery) {
SparkContext scc = new SparkContext(conf);
CassandraSQLContext csql = new CassandraSQLContext(scc);
DataFrame sql = csql.sql(sqlQuery);
String folderName = /tmp/output_ + System.currentTimeMillis();
LOG.info(Attempting to save SQL results in folder:  + folderName);
sql.rdd().saveAsTextFile(folderName);
LOG.info(SQL results saved);
}

public static void main(String[] args) {

String sparkMasterUrl = args[0];
String sparkHost = args[1];
String sqlQuery = args[2];

SparkConf conf = new SparkConf();
conf.setAppName(Java Spark SQL);
conf.setMaster(sparkMasterUrl);
conf.set(spark.cassandra.connection.host, sparkHost);

JavaSparkSQL app = new JavaSparkSQL(conf);

app.run(sqlQuery, printToConsole);
}

I can then submit this to Spark with ‘spark-submit’:


  ./spark-submit --class com.algomi.spark.JavaSparkSQL --master 
 spark://sales3:7077 
 spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
 spark://sales3:7077 sales3 select * from mykeyspace.operationlog

It seems to work pretty well, so I’m pretty happy, but wondering why this isn’t 
common practice (at least I haven’t been able to find much about it on Google) 
– is there something terrible that I’m missing?

Thanks!
Matthew




RE: Lucene index plugin for Apache Cassandra

2015-06-12 Thread Mohammed Guller
The plugin looks cool. Thank you for open sourcing it.

Does it support faceting and other Solr functionality?

Mohammed

From: Andres de la Peña [mailto:adelap...@stratio.com]
Sent: Friday, June 12, 2015 3:43 AM
To: user@cassandra.apache.org
Subject: Re: Lucene index plugin for Apache Cassandra

I really appreciate your interest

Well, the first recommendation is to not use it unless you need it, because a 
properly Cassandra denormalized model is almost always preferable to indexing. 
Lucene indexing is a good option when there is no viable denormalization 
alternative. This is the case of range queries over multiple dimensions, 
full-text search or maybe complex boolean predicates. It's also appropriate for 
Spark/Hadoop jobs mapping a small fraction of the total amount of rows in a 
certain table, if you can pay the cost of indexing.

Lucene indexes run inside C*, so users should closely monitor the amount of 
used memory. It's also a good idea to put the Lucene directory files in a 
separate disk to those used by C* itself. Additionally, you should consider 
that indexed tables write throughput will be appreciably reduced, maybe to a 
few thousands rows per second.

It's really hard to estimate the amount of resources needed by the index due to 
the great variety of indexing and querying ways that Lucene offers, so the only 
thing we can suggest is to empirically find the optimal setup for your use case.

2015-06-12 12:00 GMT+02:00 Carlos Rolo 
r...@pythian.commailto:r...@pythian.com:
Seems like an interesting tool!
What operational recommendations would you make to users of this tool (Extra 
hardware capacity, extra metrics to monitor, etc)?

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: 
linkedin.com/in/carlosjuzarterolohttp://linkedin.com/in/carlosjuzarterolo
Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
www.pythian.comhttp://www.pythian.com/

On Fri, Jun 12, 2015 at 11:07 AM, Andres de la Peña 
adelap...@stratio.commailto:adelap...@stratio.com wrote:
Unfortunately, we don't have published any benchmarks yet, but we have plans to 
do it as soon as possible. However, you can expect a similar behavior as those 
of Elasticsearch or Solr, with some overhead due to the need for indexing both 
the Cassandra's row key and the partition's token. You can also take a look at 
this 
presentationhttp://planetcassandra.org/video-presentations/vp/cassandra-summit-europe-2014/vd/stratio-advanced-search-and-top-k-queries-in-cassandra/
 to see how cluster distribution is done.

2015-06-12 0:45 GMT+02:00 Ben Bromhead 
b...@instaclustr.commailto:b...@instaclustr.com:
Looks awesome, do you have any examples/benchmarks of using these indexes for 
various cluster sizes e.g. 20 nodes, 60 nodes, 100s+?

On 10 June 2015 at 09:08, Andres de la Peña 
adelap...@stratio.commailto:adelap...@stratio.com wrote:
Hi all,

With the release of Cassandra 2.1.6, Stratio is glad to present its open source 
Lucene-based implementation of C* secondary 
indexeshttps://github.com/Stratio/cassandra-lucene-index as a plugin that can 
be attached to Apache Cassandra. Before the above changes, Lucene index was 
distributed inside a fork of Apache Cassandra, with all the difficulties 
implied. As of now, the fork is discontinued and new users should use the 
recently created plugin, which maintains all the features of Stratio 
Cassandrahttps://github.com/Stratio/stratio-cassandra.

Stratio's Lucene index extends Cassandra’s functionality to provide near 
real-time distributed search engine capabilities such as with ElasticSearch or 
Solr, including full text search capabilities, free multivariable search, 
relevance queries and field-based sorting. Each node indexes its own data, so 
high availability and scalability is guaranteed.

We hope this will be useful to the Apache Cassandra community.

Regards,

--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]http://www.stratio.com/
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42tel:%2B34%2091%20352%2059%2042 // 
@stratiobdhttps://twitter.com/StratioBD



--

Ben Bromhead

Instaclustr | www.instaclustr.comhttps://www.instaclustr.com/ | 
@instaclustrhttp://twitter.com/instaclustr | (650) 284 9692



--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]http://www.stratio.com/
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42tel:%2B34%2091%20352%2059%2042 // 
@stratiobdhttps://twitter.com/StratioBD



--





--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]http://www.stratio.com/
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // @stratiobdhttps://twitter.com/StratioBD


  1   2   3   >